CIS*2650 Programming IIWinter, 2004

Assignment 2

Due Date: 10am, February 9th , Monday, 2004

Deliverables

* A2.tar

* makefile

Objectives

* Working with multiple source files

* Scanning and parsing

* Character and string manipulation

* Tar and makefile

Problem Statement

A compiler like gcc, cc or javac is basically a translator, which reads a program written in one programming language, the source language, and translates it into an equivalent program in another language, the target language, usually (but not necessarily) from a high-level language like C and Java to a low-level language like machine language. As an important part of this translation process, the compiler reports to its user the presence of errors (if any) in the source program. The whole process consists of four phases.

Lexical Analysis

In a compiler, lexical analysis (a.k.a. scanning) reads the stream of characters making up the source program and groups them into tokens that are sequences of characters having a collective meaning. Think of the process as spell-checking. For example, the assignment statement

final_mark = 90 ;

would be grouped into the following tokens:

1. The variable identifier final_mark

2. The assignment symbol =

3. The integer constant 90

4. The semicolon punctuation ;

The blanks separating the tokens would normally be eliminated or ignored. The program that conducts lexical analysis is called the scanner. Any string that cannot be recognized by the scanner would be reported to the user as an error. For example, if we changed 90 to 90k, which is not an integer, nor a valid identifier, an error message, like "Error: invalid number <90K>", would be printed on screen.

Syntax Analysis

If all tokens in a statement check out ok, the next step is to group the tokens into grammatical phrases (like grouping words into sentences in natural language). For example, in C, the assignment statement starts with an identifier followed by the assignment symbol and a constant, ended with a semicolon. According to this definition, the statement above is a valid assignment statement.

But if somehow it is changed to final_mark 90 =;, a syntax error should be reported since it violates the C specification for the assignment statement. The program that conducts syntax analysis is called the parser.

Semantic Analysis

If a statement is grammatically correct, the next step is to check whether it makes sense. The sentence "The universe circles around the earth." is lexically and syntactically correct, but not semantically. For the assignment statement

final_mark = 90;

at least two things need to be checked:

1. Has variable final_mark been declared previously?

2. Is final_mark an integer-compatible variable (char, integer or float)?

Any answer of no to either question will cause a semantic error.

Semantic Translation

If a statement passes all three checkpoints, the last step is to execute it. In the above example, the value of variable final_mark will be updated to 90. Errors (referred to as runtime errors) might still occur at this stage, such as access violation, underflow/overflow, divided by ZERO etc.

The Assignment

For this assignment you are asked to implement a scanner and a parser, which work collaboratively to detect whether an input statement (one line per input) is a valid expression. The two will interact with each other through shared functions, shared constants, and/or shared variables, which are defined in their header files. Review Lab #3 if necessary.

Try this example on this demo (right click, save link as) to see what is expected. It can only be run on Linux by typing in acc (or whatever file name you saved it as). You need to change the file attribute before running it. On a terminal, enter chmod +x acc (or whatever file name you saved it as). The program runs forever until it is manually terminated by the user by pressing Control + C. The demo only gives a general idea of the requirements. Your program does not need to be exactly like that. If you can make yours better, by all means.

Details and mark distribution (25 plus 5 bonus) are explained below.

Question #1 Scanner (12 marks)

The scanner program, scanner.c, must implement at least one function:

int GetToken ( char* token )

which reads one token from the input and returns the toke attribute (i.e. what kind of token it is: an operator, variable, integer, the new line character, or unknown?) to the caller (the parser in this case). The token string will be stored in variable token.

For example, assume GetToken reads "final_mark" from the input, which is a variable, then token = "final_mark", and VARIABLE will be returned to the caller indicating that it is indeed an variable, where VARIABLE is an integer constant predefined by #define in the parser header file, and shared by the scanner and the parser.

In another example, assume GetToken reads "90" from the input, which is an integer, then token = "90", and INTEGER will be returned to the caller. Again, INTEGER is an integer constant predefined by #define in the parser header file, and shared by the scanner and the parser.

Tokens

The following are valid tokens:

* variable

* operator like "+", "-", "*", "/"

* integer

* the new line character '\n', indicating the end of the current input

Anything (except blanks) that cannot be identified as above will be returned as UNKNOWN to the caller. To make your life a little bit easier, assume tokens are separated by space(s). Therefore, "a+b" will be considered as one token, which of course is invalid (UNKNOWN), while "a - 10" with spaces in-between will be considered as three tokens: variable "a", subtraction operator "-", and integer "10".

Integer

This assignment only considers integers, which could be positive or negative. "0123", "+0123" and "-123" are all valid integers.

Variable

Like in C, a variable name must be a valid identifier which is a sequence of characters consisting of underscores ("_"), letters and digits that does not begin with a digit. Each identifier is no more than 20 characters long and case matters. So a1 and A1 are different identifiers.

The essence of the scanner can be summarized in two questions:

1. How to tokenize an input? That is, if "abc - 9" is the input, how to retrieve string "abc", "-" and "9" one by one?

2. Once a string (token) is retracted successfully, what does this string represent? A variable, an operator, an integer, or something unknown?

Question #2 Parser (8 marks)

The parser (parser.c) calls the GetToken function defined in the scanner to obtain tokens, and groups them into an expression, or reports error if something doesn't fit syntactically. Expression is defined as following:

1. Any identifier is an expression;

2. Any integer is an expression;

3. if expr1 and expr2 are expressions, so are

* expre1 + expre2

* expre1 - expre2

* expre1 / expre2

* expre1 * expre2

Here is how the parser should work:

1. Ask the user to input one statement on one line;

2. Repeatedly call the GetToken function in the scanner to get the tokens;

3. Check the appearing order of the tokens against the expression rules given above;

4. If the sequence of the tokens falls under one of the expression categories when the parser encounters the new line character '\n', the input is a valid expression and an "Expression accepted" message should be printed out; otherwise, report error message.

5. Repeat step 1 (the program runs forever until being terminated manually by pressing Ctrl+c).

The best approach to describe the behavior of the parser is to use a finite state machine (FSM), which will be discussed in Lab #3.

Question #3 Error Reporting (5 marks)

Coming up with a meaningful and accurate error reporting mechanism is one of the most challenging and important tasks in compiler construction. Anything that doesn't fit into the prescription of the expression syntax should be reported as a syntax error. But a general "syntax/parser error" message doesn't help much to locate the cause. Your program should at least be able to catch the following errors:

* Missing operand (like + a / b)

* Invalid identifier (like ^a - 10)

* Missing operator (like a b * c)

* Invalid number (like 10k + c)

Do your best to pinpoint the error and the cause. Feel free to change the error message to make it more meaningful. Try these "bad" examples on the demo to see what it is capable of.

Question #4 Surprise Me (Maximal 5 bonus)

You are not condemned to any specifics for this question. Use your own creativity and imagination to enrich your program. the following are some ideas you might consider.

Idea #1: The above questions deal with integers only, which is not practical (well the whole thing is impractical but at least we can make an effort). Modify your program to handle float operand as well. (3 bonus)

Idea #2: A more general form of expression supports brackets as well, like ( a + b ) / ( c * ( d + a ) ). Extend your program to support this enhanced expression specification. (5 bonus)

Idea #3: As stated early, coming up with a meaningful and accurate error reporting mechanism is one of the most challenging tasks in compiler construction. For example, "a+10" would be reported as an "invalid token" in the given demo. But a more likely scenario is that the user forgot to type in spaces between tokens. So a more accurate message should be something like "spaces expected between tokens". Such approach involves guessing and prediction. Modify your program to enhance your error reporting ability. (3 bonus)

Idea #4: A well-designed scanning method can eliminate the necessity to use space(s) to separate tokens. Rewrite your scanner so that there are spaces between tokens or not won't matter. (5 bonus)

These are merely suggestions and you don't have to follow them. You are encouraged to create your own problem(s).

Hand-in Procedure (Important)

Your assignment should consist of the following files: scanner.c, scanner.h, parser.c, parser.h and readme.txt. Storing all programs in one C file subjects to 20% penalty of the total. File readme.txt should contain brief descriptions of your program, including:

* What questions you have done

* Explain how to use your program in a nutshell

* If you have answered Question #4, describe what you did

* Limitation or special features that worth mentioned

* How many hours you spent on this assignment

You should archive all relevant files into a tar-ball, A2.tar, by using the tar command on Linux:

tar -cvf A2.tar scanner.c scanner.h parser.c parser.h readme.txt

You are also responsible for writing a makefile, which, when executed, will un-archive the tar-ball, compile all C files, and link them into one executable acc. Fail to submit the tar file or the makefile subjects to 20% penalty of the total. Not including the readme.txt file also subjects to 10% deduction of the total. More information about the tar command can be found on page 125, Chapter 12.3.3 in the official Redhat Linux manual, or in the Lab Materials folder on WebCT.

You must hand in A2.tar and the makefile through WebCT:

* To upload completed assignments, click Assignment and lab test submission. The Assignments screen appears.

* Under Title, click the assignment that you want to submit. The Assignment screen appears.

* Click Student files. The Student Files for Assignment screen appears.

* Click Upload. The Upload File for Assignment screen appears.

* To locate the file, click Browse. Your computer's file manager appears.

* Select the file. The Upload File For Assignment screen reappears, with the name of the file in the Filename text box.

* Click Upload. The Student Files for Assignment screen appears.

* Click the Return to Assignment link at the top of the screen. The Assignment screen appears.

* Click Submit assignment. The Submit Assignment screen appears. You can receive email notification that your assignment was submitted successfully. In the text box, type your email address.

* Click Submit assignment. A confirmation message appears.

* Click OK. The Assignments screen appears with the message Submitted appearing in the Status column. When the assignment has been graded, this message changes to a Graded link. Click the Graded link to view your grade and any instructor comments.

** There is a very common mistake that students upload their files and forget to hit the submission button. Remember, this is a two-step procedure: upload your file(s) first then submit. If you have more than one files, you need to upload several times, but only submit once.

Your program will be marked based on style and correctness. Style includes proper indentation, meaningful names and documentation (explanation of the functionality of each function implemented, the meaning of important variables etc). The maximal deduction for lacking or poor style is 20% of the total grade you earn. For example, a student did Q1 and Q2 but with poor indentation and no comment. His/her final grade for this assignment would be 20-20*20%=16.

If you develop your program on a different platform (i.e. different system and tool), make sure it is compatible with Redhat Linux 9.0 and gcc 3.2, which are used to compile, execute and mark your assignment. An assignment that cannot be compiled or is not submitted properly receives a grade of ZERO. If you have any questions about the hand-in procedure, please ask the TA well before the due date.

Piece of Advice

Attend the lab !!! This assignment involves lots of technical details not being taught in class. How to share functions and variables among different programs and how to link several program modules into one executable are some of the critical issues to be addressed in the lab.

Examples

/* don't the forget the space separating each token

test1 / 010

A + +19 * _tax

_b - -1

Bad Examples

//#1: missing operator

abc 10 +

//#2: invalid variable name

a$c + 10 ;

//#3: invalid start of an expression

+ - ab -10 ;

//#4: missing space separation treated as one bad token

a+1;

//#5: invalid integer

a * 10d ;