Assignment 1 FAQs

  1. Can I use Regex to recognise tokens?

No. We are developing a scanner to be used in a compiler to compile a substantial language, VC. A scanner is not written to call another scanner to do the job.

  1. How are invisible characters handled?

Depending on where they are. When appearing inside a string, an invisible character will be handled in the same way as how a normal character is handled. Otherwise, an error token will be returned. See the answer to Question 7.

  1. Once an "unterminated comment" error is detected and reported, should my scanner resume and continue to work on the remaining input?

This is entirely up to you. There will be simple test cases designed to check if your scanner can detect "unterminated comment" errors. Once such an error is detected and reported, your scanner can simply stop processing the remaining input.

  1. How should a scanner behave if the last line is terminated by the EOF without a line terminator?

Like Java, VC requires that every line be terminated by a line terminator. Therefore, the program as specified is not a legal VC program. However, in all test cases we use for marking Assignment 1, each line is always terminated by a line terminator. Therefore, your scanner does not have to handle this situation in any special way. In other words, you can assume that all lines in a VC program are always terminated by a line terminator.

  1. For the test case below:

"\

should we return two errors, first illegal escape character, and then unterminated string, or just one error.

It makes sense to at least print the error message:

ERROR: unterminated string

The string is indeed not terminated since the scanner is expecting to see a char after the "\". Of course, you may also want to print a second error message like:

ERROR: illegal escape character

Try a C or Java compiler to see what error messages will be produced.

For auto marking purposes, it is up to you whether your scanner prints one of the messages or both. All are acceptable.

  1. Does VC support nested comments in /* */?

No. See Section 4.2 of the VC Spec.

  1. Are there any invisible characters that don't take up a column.Eg ASCII beep, that my columnIncrementer should skip?

It is up to you how to interpret them. Since a program with invisible characters is not syntactically legal, how you handle them will not have any effect on the subsequent assignments. On encountering an invisible charater, your scanner will return an error token to the parser (according to the spec). Just make sure your counting mechanism works for programs free of invisible characters.

As for how to handle invisible characters in production compilers, it really depends on how the offending lines are displayed on the screen, together with the error message. For example, the "vi" editor tends to display all invisible control characters with a two-character sequence starting with a caret (^). For example, the ASCII beep is displayed as "^G". Then it may make sense to count the beep as taking two columns. If the following line is displayed as such:

^Gx = 1;

then it makes sense to say that x starts at column 3 rather than 2.

In Java, the problem is solved easily without relying on using columns to mark the position of a particular token. Given the following input:

public class X {

public static void main(String[] args) {

intx^G = 1;

}

}

where there is a beep after the variable x, the Java compiler produces the following error message:

[ives 8:48am] ~/tmp % javac X.java

X.java:4: illegal character: \7

int x = 1;

^

1 error

  1. I was wondering whether we are meant to print an error message if an EOF marker is found before the end of line in a "//" style comment.

Section 4.1 in the VC specification requires that every line be terminated by the ASCII CR, LF or CR LF. Therefore, the proposed situation is not allowed.

  1. Should I always count a tab as 8 characters long?

According to the spec, "a tab is assumed to be 8 characters long." This does not mean that every tab that the scanner encounters occupies 8 blank spaces. In each of the following lines, there is a single tab between 1 and 9:

19

x19

xx19

xxxxxx19

Hopefully, this example makes it clear how tabs should be handled.

  1. What lexemes to return for whitespace and comments?

Whitespace and comments are not tokens, and consequently, should just be discarded by the scanner.

  1. The supplied Java files did not compile for me

You need to specify the path that java uses to look for (VC) classes. There are two ways of doing this. You can specify the path using the option "-classpath" everytime you run java. Alternatively, you can set the environment variable CLASSPATH so that it includes the directory where VC has been installed. For example, I have installed VC under my home directory /home/jingling. I use tcsh. The command for setting CLASSPATH is:

setenv CLASSPATH .:/home/jingling

For bash users, the command becomes:

export CLASSPATH=.:/home/jingling

Of course, if CLASSPATH pre-exists and does not include the directory containing VC, the commends will be:

setenv CLASSPATH ${CLASSPATH}:/home/jingling

For bash users, the command becomes:

export CLASSPATH=${CLASSPATH}:/home/jingling