The Original Publication of This Document Was In

READERS ARE CAUTIONED THAT THIS DOCUMENT MAY CONTAIN SOURCE CODE IN THE FORM OF PATCHES OR CODING EXAMPLES.

DESPITE USE OF THE BEST AVAILABLE OPTICAL CHARACTER RECOGNITION EQUIPMENT, AND VISUAL COMPARISON OF THE RESULTS WITH THE ORIGINAL AFTERWARD, THE INTEGRITY AND CORRECTNESS OF ANY CODE SEGMENTS IN THIS DIGITIZED DOCUMENT CANNOT BE GUARANTEED.

AS WAS TRUE IN THE ORIGINAL, NO RESPONSIBILITY IS ASSUMED BY ANYONE IF YOU CHOOSE TO USE THE MATERIAL IN THIS DOCUMENT.

______

THE STANDARD TRUETYPE FONTS “ARIAL,” “TIMES NEW ROMAN,” AND “COURIER NEW” HAVE BEEN USED WHERE POSSIBLE. BE AWARE THAT FONT SUBSTITUTION MAY CAUSE THE RESULTING DOCUMENT TO BE MISALIGNED.

THE ORIGINAL PUBLICATION OF THIS DOCUMENT WAS IN:

Proceedings of the Digital Equipment Computer Users Society, Anaheim, California - December 1982, pp. 401 - 413.

THIS DOCUMENT WAS DIGITIZED AND CONVERTED TO MS WORD FORMAT BY MACHINE INTELLIGENCE. OUR BUSINESS IS RSX, PDP-11, VAX, VMS AND DEC SOFTWARE, SYSTEMS ANALYSIS AND CONSULTING.

VISIT OUR WEB SITE AT FOR MORE RSX AND DEC INFORMATION.

Digitized from the original 2 August 2001 by Machine Intelligence, RSX Software and Consulting -

State Table Parsing for Large Grammars

Bruce R. Mitchell

PO Box 601

Hudson, WI 54016

ABSTRACT

When it is necessary to parse a large grammar, such as a grammar for a CLI, it is necessary to use techniques which allow for complex and expandable grammars. One such technique is state table parsing, which an associated "dictionary" for lookup of grammar elements; this technique is both flexible and easily expanded. This paper discusses and demonstrates techniques used in state table driven parsing.

INTRODUCTION

In this day of inexpensive, intelligent video terminals and menu driven software, the techniques of lexical analysis and command line parsing have fallen into disuse. The current attitude toward operator input seems to be that it should be made "idiot proof" through the use of video screens and bounded inputs, and that English language input to programs is a limited and obsolete approach.

This is an unfortunate situation, because the techniques of lexical analysis and parsing are useful for much more than simply accepting and processing command lines to programs. The basis of all compilers, assemblers, command line interpreters and database maintenance programs is a parser which takes an input (often a freeform input) and converts it to an internal format which is usable by the processing section of the program. While a screen driven assembler is certainly feasible, it is just as certainly not desirable!

Where does parsing find its best applications, then? For the classical applications of parsing - analysis of input lines - it is best used in applications where the input is freeform, structured according to some set of grammatical rules, and the desired output can be expressed as one of a number of unique cases of combinations of inputs.

There are many methods of parsing input; some are codeintensive, some are dataintensive. Some are quite simple and others are quite complex. The most popular method of parsing is statetable driven parsing; it is dataintensive and very efficient in terms of code. To understand the basis of statetable driven parsing, though, it is necessary to work through the basics of freeform input and simpler methods of parsing. These basic principles begin with obtaining and conditioning an input line.

OBTAINING AN INPUT LINE

The first thing that should be done when parsing is to get some input from the user. Input is very important to a program; without it the program becomes very unhappy. Our programs should all be happy programs, so provision should be made for the user to provide input.

A typical input line to a dungeon parser might look something like this:

> Drive Batmobile over cliff

Well, perhaps it's not all that typical. However, it can be seen that input lines are usually short, and are certainly only one line long. With this in mind, all that is necessary at this stage is to get an input line containing something the user thinks the program will recognize.

Prompting

Prompting is simply printing a special character at the beginning of a line to notify the user that input is expected.

Prompts take several forms, but two popular ones are question mark and space, and right angle bracket and space. When the user sees either of these prompts at the start of a new line, he knows that he may enter a new command. On a terminal, the prompting characters and user input lines might look like this:

$ Attack troll using sword

> Attack troll using sword

If you feel particularly expansive about prompting, it is possible to go one step further, and use the RSX readafterprompt QIO. This is nice for the user, as he can then tell the difference between a system prompt and a program prompt. This type of prompt and an input line might look like this:

DUN> Attack troll using sword

Reading And Storing the Input Line

The program, after prompting for input, should read an input line from the terminal. Depending on the language and how low level the program is going to be, this may be quite simple or very tedious. However it is done, there are common points in all readandstore routines.

The input line must be read as single characters and stored the same way. (This is not true for languages which can handle words as character strings.) FORTRAN users should read the input line as characters, using an nA1 format, and store it in a character (byte) array. BASIC users should use LINPUT or the equivalent to insure that they get the whole line (not just the line up to a delimiter), and store the line in a string. All other languages should read a line of characters and store an array of characters.

The input line should be stored exactly as read, with the exception that some systems and OTSs allow or force the conversion of lower to upper case on input. This is tine, because it must be done next in any case. Other than this, no processing of the input line should be done.

CONDITIONING THE INPUT LINE

The program now has an input line stored and is ready to start processing it. From this point until the next prompt for input is issued, the user is out of the picture; nothing he does can affect what happens as a result of his command.

The raw input line as stored by the program may or may not be acceptable to the routines which follow. A good deal of input conditioning is needed if the parsing routines are to be as simple as possible; that is, the program should only have to deal with a limited number of possible things on the input line. Therefore, some conditioning of the input line is in order.

Some operating systems or languages can do much of this work for the program, if the language OTS or operating system is asked to do it. Lower to upper case conversion comes immediately to mind in this category of processing functions. Some higherlevel languages can even do word breakout for a program. Bear in mind, then, that some things discussed here are not applicable to all systems, or may be trivial to do given the capabilities of a chosen language and operating system.

Deleting Special Characters

Most systems trap out special and control characters below ASCII 32, but some don't. There is generally no use for these special characters, so if they aren't trapped, the input array must be scanned for these characters and an error message printed if any show up. This detection is quite simple in most languages, and (if necessary) should be done before any lower to upper case conversion due to the method which is presented for lower to upper case conversion. For demonstration purposes, here is a metalanguage example of checking for special characters:

POS = 1

BADCH = 0

WHILE ((POS < ENDPOS) AND (BADCH = 0))

IF (IN(POS) < 32) BADCH = POS

POS = POS + 1

END WHILE

When the program drops out of this examination loop, if the BADCH variable is anything but 0, an error message should be printed identifying the bad character, and the program should return to the prompting and input reading section.

Note that the metacode presented here examines the line only until the first bad character is seen. There may be more than one error in the input line, but one is enough to stop processing of the line. This saves some code and forces the user to correct the input line an error at a time if multiple errors are present.

Lower to Upper Case Conversion

For those who have a system with upper and lower case capabilities, a problem arises immediately. The parser dictionary (to be discussed later) cannot possibly store all combinations of upper and lower case letters a user might type in for each word. It is necessary to convert the input line to all upper (or all lower) case letters.

This processing is simply done if the chosen language has the capability of comparing characters and/or treating characters as numbers. Because the ASCII codes for uppercase A to Z are separated from the ASCII codes for lowercase a to z by 32 decimal (uppercase A is decimal 65, lowercase a is decimal 97,) it is only necessary to subtract 32 decimal from the numeric representation of the lowercase letters to change them to uppercase. (This is a valid operation, because the special characters have already been detected and trapped.)

Here is some example metalanguage code to demonstrate this type of conversion:

POS = 1

WHILE (POS < ENDPOS)

IF (IN(POS) > 95)

IN(POS) = IN(POS) 32

END IF

POS = POS + 1

END WHILE

Some clever programmers out there are now saying: "Aha! There is more above ASCII 95 than just the lowercase letters! This doesn't work!" That's true; there is more above ASCII 95 than the lowercase letters. There are six other characters, which are the backquote `, left and right setbuilder brackets { and }, the vertical bar |, the tilde ~, and DELETE.

Let's see what happens if this method is applied to those characters. The DELETE character is generally used to rub out a character, so it never appears in the input. The remaining 5 characters have no use in most programs, so if they are ignored, they convert to the at sign @, left and right square brackets [ and ], backslash \, and up caret ^. Those characters are probably not useful either and most likely will be thrown out later, so this method of changing lower to upper case DOES work.

Detecting Unwanted Characters

In general, a program accepting command lines needs to recognize only characters A through Z, 0 through 9, and space. Everything else is generally superfluous. If it is assumed that the input line is already converted from lower to upper case, and the line is free of special characters, the problem is now detection of anything other than the characters desired. Again, if the input line is stored as ASCII and the chosen language can handle ASCII characters as numerics, this is no problem.

The characters that might be in a good input line are ASCII 32 (space), ASCII 48 through 57 (0 through 9), and ASCII 65 through 90 (A through Z). Here is an example of metacode to detect invalid characters:

POS = 1

BADCH = 0

WHILE ((POS < ENDPOS) AND (BADCH = 0))

IF ((MN(POS) > 32) AND

((MN(POS) < 48) OR (IN(POS) > 57)) AND

((MN(POS) < 65) OR (IN(POS) > 90)))

BADCH = POS

END IF

POS = POS + 1

END WHILE

The long IF statement looks peculiar, but it works. Anything other than the desired space, numerals, or uppercase letters is detected by this code. Again, if BADCH is anything but 0, it is pointing at the first invalid character on the input line.

It is obvious that three separate loops to check each case of bad input characters are not necessary; a single one suffices, with the checking and conversion done in the order given here. Control characters can be trapped, lowercase characters can be converted, and invalid characters can be trapped all inside a single loop.

With the input line conditioned to reflect the needs of the parsing routines, let us proceed to the next stage of parsing, which is conversion of the input line into words.

THE DICTIONARY

Before writing code to break the input line up into words, the implementation of the program's vocabulary, or dictionary, should be discussed.

Every program which accepts command lines must have a limited and structured syntax. Due to this limited syntax, it sometimes happens that under the best circumstances, the most (apparently) logical command may return to the user the response - "I don't know that word", or something to that effect.

This problem is an outgrowth of the fact that storing character strings is a memoryexpensive thing to do. An average sized dungeon program has a 200 to 400 word vocabulary; a large, complex dungeon might have a larger vocabulary. Compared to the 800 word Basic English vocabulary - a very limited subset of English - a dungeon program seems onefourth to onehalf as intelligent as the Basic English user.

This problem is not as bad as it seems, for two reasons. The first is that most programs are limitedvocabulary intensive; that is, a limited number of words are used frequently. Consider: With 15 words, basic dungeon functionality can be implemented. Which 15 words? These: Drop, Take, Examine, Quit, Score, Up, Down, North, Northeast, East, Southeast, South, Southwest, West, Northwest. These take in most cases that dungeon players find; moving, taking something, dropping something, examining something, and finding his score. While it wouldn't be inspiring, a dungeon could be written using only these fifteen words.

The second reason that a limited vocabulary is not necessarily bad is this: The more complex a vocabulary becomes, the more difficult it is to write routines to understand it.

When the vocabulary is expanded to include many words, and words which permit complex cases, the parsing routines and database rapidly grow to take up a prohibitive amount of space. There is much good to be said for a simple, limited sentence structure; one of the things particularly good about it is that the user does not have to wonder what strange command is necessary to effect the desired action.

With this in mind, what is necessary to implement an effective dictionary? Surprisingly, only two things: (1) A list of words. (2) A list of characteristics of the words. Each topic is now dealt with individually.

The Word List

The dictionary, per se, is the list of words which the parsing routines recognize as valid. This list of words is an important part of the program; the structure of the program and the methods of use revolve around this word list.

If a word is not in the list which should be there, the program has a hole which is irritating to users and will eventually come back to haunt the author in the form of a patched version of the program which must be distributed to all users.

On the other hand, if the word list is huge, encompassing every possible word and synonym for the program's objects and actions, then the parsing routines become complex, the word list takes up space which can be better used by code, and most of the list is unused. The goal of creating an effective word list, then, is to have a list which encompasses all necessary words and all words which are potentially useful, but no words which are unnecessary.

To understate the problem, this can be difficult to accomplish. A good basic vocabulary for a CLI can be implemented in 100 to 200 words, but they must be chosen with care. It is often found that there are problems with synonyms to be resolved for example, a "program" can also be "code" or "source". How much redundancy will be allowed in the application must be decided.

The solution to these problems, one and all, is to maintain the vocabulary as a file which is read in by the program at startup time. This allows quick changes to the vocabulary when they are desired.

Word Characteristics

In most command grammars, there are some elements which are useful for a particular purpose, and are good for no other purpose. The same is true of the words in the dictionary; they will fit in certain places in a command, and will not fit in others.

Since grammar is an uninteresting subject, it is best approached from a practical viewpoint. The functionality of the words is what we want to know, and (for a dungeon) this functionality easily breaks down into 15 major categories:

1. Null word - such as A, AN or THE

2. System command - such as DIAGNOSE or SCORE

3. Take command - such as TAKE or GRAB

4. Move command - such as RUN or WALK

5. Verb - such as ATTACK or TURN

6. Object - such as GEMS or VASE

7. Inhabitant - such as THIEF or DWARF

8. Adjunct - such as USING, WITH or UNDER

9. Direction

10. Magic word - such as SHAZAM

11. Cuss words

12. DROP command

13. EXAMINE command

14. READ command

15. End of line - a marker to show end of line

It may be noted that "end of line" is not really a word. It is necessary for state table parsing, though, so it is included in the list. Drop, examine and read commands can be implemented as general verbs, but are more conveniently treated if entered as a separate word class. Please note that this list is specific to a dungeon program, and that each application is different in classification of words in the dictionary.