CmSc315 Programming Languages
Homework 03, due 02/05 by midnight
The first task of the interpreter is to split each statement into tokens. A token is a string of symbols with a defined meaning in the language. For example, labels and variables, numbers, keywords, and each operator symbol in the language would be tokens.
The purpose of this homework is to develop a function called tokenizer, that would do the following:
- Split each statement into tokens and store the tokens into an array.
- Report an error.If the statement contains a symbol that is not in the alphabet of the language, an error should be reported. The error message should contain the unrecognized symbol , the number of the statement, and the statement where the error was found, e.g.
unrecognized symbol % in statement 5: L a = a % b
In case of an error, the statement is not processed further, however the next statements should be processed. Thus in one run the user can get all errors found by the tokenizer.
Function specification:
Input:
An array of type string containing the source program in the simple language. Each element in the array corresponds to a line in the program.
SIZE - The number of the statements in the input program
Output:
A table of type string where each row corresponds to a statement. Within a row the elements of the table contain the tokens of the corresponding statement.
An array of type int and of length SIZE, containing the number of the tokens in each statement.
Boolean variable indicating whether there was an error or not.
Here is an example of the results of the tokenizer:
Input file:
read a
read b
L1 if b = 0 goto L2
a = a+1
b = b-1
goto L1
L2 print a
halt
Number of tokens / Table of tokens2 / read / a
2 / read / b
7 / L1 / if / b / = / 0 / goto / L2
5 / a / = / a / + / 1
5 / b / = / b / - / 1
2 / goto / L1
3 / L2 / print / a
1 / halt
What to turn in:
- If C++ is used, turn the source code of the program (MAIN plus function tokenizer). If java is used – turn in the zipped project. The main program should read the source program from a file and store the program line by line in an array of type string. This array will be the input to the tokenizer. After the tokenizer has split the program into tokens, the main program should print the results – the token table and the array that shows how many tokens are there in each line.
- Read_me file with instructions how to run your program.
- Sample file with correct statements
- Sample file with errors.
Here are some hints how to implement the tokenizer, oriented towards C++ programmers
Data structures
To perform this, your main program needs the following basic data structures, that would be created / used by the tokenizer:
string statements[] / Array that contains the source programint st_num / Number of statements in the source program
string token_items[][] / a two dimensional array of strings. Each row corresponds to a statement and contains the tokens in that statement. You have to specify the dimensions.
int tok_number[] / this array gives the number of tokens in each statement:
tok_number[k] is the number of tokens in the k-th row of token_items[][]
bool tok_error / return value of the tokenizer - false if the currently processed statement does not have an error, true otherwise
Given the above data structures, the function tokenizer should have the following prototype:
bool tokenizer(string[], int, string[][], int[]);
A call to the tokenizer would be:
tok_error = tokenizer(statements, st_num, token_items,
tok_number);
where:
statements[]is the string array with the source program. Input parameter
st_num is the number of statements. Input parameter
token_items[][] is the table to be filled with tokens by the function. Output parameter.
tok_number[] is the array that shows the number of tokens in each line. Output parameter.
Note that all output parameters are of type "pass by reference".
Here is the outline of the algorithm in the main program:
- Set number of statements equal to 0
- Read the source program from a file into a string array and determine the number of statements.
- Call the function tokenizer.
- Print the results
- If there was an error print message 'Translation terminated"
Here is the outline of the algorithm of the tokenizer:
For each source statement:
Copy the statement in a temporary string temp, set number of tokens for the current statement to zero
While temp is not empty do
Call a function segmentor that returns the token at the beginning of temp
and the remaining string in temp.
if there is an error – print appropriate message and proceed with the next
statement
If no error, store the token appropriately in token_items table, and
increment the number of tokens in the statement
Here is the outline of the algorithm of the segmentor. It separates one token from temp into a variable token:
Set error flag to false
Remove all leading spaces
Store the first character into token, and remove it from temp
If the first character is a digit:
Append into tokenall subsequent characters that are digits and remove them from temp.
If the first character is a letter:
Append into tokenall subsequent characters that are digits or letters and remove them from temp.
If the first character is a special symbol (e.g. operator) do nothing.
If the first character is none of the above, set error flag to true
Return error flag, temp and token
Note: temp and token are passed by reference.
You may use the library functions isdigit(char) and isalpha(char) when checking for valid symbols
1