CSCI 431 Assignment One: The LegoLogo Lexical Analyzer (LLLex)

History

Many attempts have been made to create programming languages which are intuitive and easy to learn. One of the best of these was LOGO which allowed children as young as 3 to learn a computer language. This language involved a “turtle” which could be driven around the screen using simple instructions. The turtle, when viewed from above, was represented by a triangle.

Overview

Over the course your first three assignments in this class, you will write a compiler to support the programming of a Lego MindStorm robot in a LOGO-like language that we will call LegoLogo. The first part of this project is to write the lexical analyzer (LLLex) that will be used by your compiler. The job of the lexical analyzer is to convert a LegoLogo program into a stream of tokens.

LegoLogo Language Specification

The grammar of LegoLogo in EBNF notation is as follows:

<program> ::= <statement list> eof

<statement list> ::= <statement> <statement list> | e

<statement> ::= <turtle action> | <repeat statement>

| <variable declaration> | <variable operation>

<turtle action> ::= FD <varnum>

| BW <varnum>

| LT <varnum>

| RT <varnum>

| BEEP <note> <varnum>

<repeat statement> ::= REPEAT <varnum> [ <statement list> ]

<variable declaration> ::= VAR <identifier> <varnum>

<variable operation> ::= * <identifier> <varnum>

| + <identifier> <varnum>

| - <identifier> <varnum>

| / <identifier> <varnum>

<varnum> ::= <identifier> | <integer>

<note> ::= DO | RA | ME | FA | SO | LA | TE

The terminal eof represents an end of file character and e represents the empty string. The two non-terminals identifier and integer are generated by the regular expressions below.

Other Defined Tokens

<identifier> ::= <letter> { <letter> | <digit> | _ } *

<integer> ::= <digit> { <digit> } *

where letter = [a-zA-Z] and digit = [0-9].

Reserved Words

FD / BW / RT / LT / BEEP / REPEAT / VAR / DO / RA / ME / FA / SO / LA / TE

Special Symbols

+ / - / * / / / [ / ]

White Space

·  White space includes blanks, newlines, and tabs.

·  White space is ignored except where it must separate IDs, NUMS, keywords and special symbols.

Assignment?

You’ve probably noticed that there is no assignment operator. In this language, assignment is implicit when a variable is declared, and as a the result of an arithmetic operation. Each arithmetic operator receives two parameters, a variable name and an argument. The result of the operation is stored in the variable. For instance,

VAR x 10

+ x 5

makes the values of x equal to 15.

Example Programs

Just to give you a feel for the language, here are few example programs. In these illustrations, only the path of the robot is depicted.

Example 1

FD 30

LT 45

FD 30

LT 45

FD 30

LT 45

FD 30

LT 45

FD 30

LT 45

FD 30

LT 45

FD 30

LT 45

Adding Loops

REPEAT 8 [

FD 30

LT 45

]

Using Variables

VAR A 1

REPEAT 100 [

VAR C A

* C 2

FD C

RT 62

+ A 1

]

Nested Loops

VAR A 1

REPEAT 50 [

FD A

RT 30

VAR B 1

REPEAT 8 [

VAR C A

/ C 5

FD C

RT 45

+ B 1

]

+ A 1

]

Your Assignment: LLLex

For this assignment you will write a lexical analyzer, a scanner, that takes a text file of characters and produces a stream of tokens to be passed along to the parser part of a compiler's front end. Tokens are an internal data type of some sort, containing the information that might be needed about each token. You should also create a "symbol table" to hold information about identifiers: at a minimum, the symbol table will be used to store the values of the variables.

Your compiler must accept a single argument which is the name of the source code file that is to be translated. The lexical analyzer will need to receive the file name from the compiler. The scanner should have a “nextToken” method that returns a description of the next token in the file each time it is called. Perhaps the most expedient approach to designing the scanner is to read all tokens from the source code file into an array, and then write a “nextToken” method that can be used to return one token description at a time in the order that the tokens appeared in the original file.

The lexical analyzer should produce no output of its own. Create a short "driver" program that calls the “nextToken” method repeatedly and prints the tokens returned to the monitor.

There are lexical errors that might occur: illegal symbols in the file, for instance. You must choose how to handle these errors. One way would be for the scanner itself to generate error messages. An alternative would be for the scanner to call a separate error handler. A third would be for the scanner to return a special error token and let the main program/parser handle the error message.

The lexical analyzer must be written in Perl (see resources below).

Resources

Perl should be installed on all Linux boxes in RBH 004. It is also freely available for all platforms: Perl can be found at http://www.perl.com/. There is also a nice IDE for Perl called Komodo on the linux boxes in RBH004. I am told that to use Komodo in RBH 004 you need to add the following line to your .cshrc file:

setenv ACTIVESTATE_LICENSE /usr/local/Komodo-2.3/.ActiveState/ActiveState.lic

A scanner/parser very similar to the one you must write was given as an assignment last year at UNC Chapel Hill. Both the assignment and the solution (written in Python) are available on-line.

What to Turn In

Transfer all source code to your class ftp directory. (Contact me if you don’t know your ftp directory name.) Be sure that your code is adequately commented. This assignment is due by midnight of Sept 12th.