Tokennizer

We are going to analyze and implement a program to tokenize a string expression. The lecture covered the steps of analysis such as I/O investigation, pseudo-code of various parts. Here is the implementation strategy.

Phase 1:Code for a simple case: integer token

//handle the case of token number.

Input : “125+ x” Expected output: (“125” , number)

Phase 2. Create a framework for a general case as follows

//Read the string expression: StrExp

//Extract into variable firstChar, the first character of StrExp

repeat //until no more char in StrExp

//Inspect firstChar

case firstChar

‘0’..’9’ or ‘.’: // Phase 1

// Get the token and print it out

// Reset token to be null

blank : while ( firstChar is blank)

{ Reduce StrExp by its first character

Extract into into firstChar the first character of StrExp}

‘x’: print x and its type

Reduce StrExp by its first character

Extract into into firstChar the first character of StrExp

LeftParenthesis: print ( and its type

Reduce StrExp by its first character

Extract into into firstChar the first character of StrExp

RightParenthesis: print ) and its type

Reduce StrExp by its first character

Extract into into firstChar the first character of StrExp

…………………………….

endcase

until there is no more char in strExp

Testing:

Input : “ 125.2 + x” Expected output: (“125.2” , number) (“+”, operator) (“x”, Math variable)

Input : “ 125.2 + 45 ” Expected output: (“125.2” , number) (“+”, operator) (“45”, number)

Phase 3: Error Detection and Hidden Tokens.

  • Try to detect the input error, extraction error and then fix them. The error can be caused by the extraction of a null string.
  • Expand the current source code to cover the case of

125x or x125.

  • We have discussed in class that when we find a new token, the token will be print out with its type. Let’s save its type into a variable tokenType. This variable is initialized to be –1 as no token. Before we set a tokenType be be a new value, it holds OF COURSE the token type of the LAST token. We can this variable to see if we have 125x. In this case we inset a new token “*” and its type, then we will print the current token and set the tokenType to be the current type.

Phase 4: Save the output into a linked list

// Build a class of TokenNode of 3 fields: tokenName, tokenType, and tokenNext

class TokenNode

{

String tokenName

inttokenType

TokenNodetokenNext

}

// Declare a variable to hold the linked list of tokens and initialize it to be null.

TokenNodetop  null;

// At any println in your source code, add the following code to print (link) to the variable top

// Create a new node for a new discovered token.

temp  new TokenNode();

temp.tokenName  printed token

temp.tokenType  printed type

temp.tokenNext  null

// Link it to the variable top

temp.tokenNext  top

top  temp

// After all tokens are extracted (in this case StrExp is empty), we print the linked list

// print out the linked list for testing

temp top

while (temp is not null) do

{ print (temp.tokenName);

println(temp.tokenType);

temp temp.tokenNext;

}

Note: You would not be surprised that the token stream is printed in a reversed order. Look at the code and modify it so that the var. top will hold the tokens in a proper order.

Answer:

In this case you need a new variable bottom that references the last node. You must insert the codes below appropriately to get the desired result.

// first time top and bottom are both null

top = null; bottom = null

// for the first node temp, top and bottom point to the same node

top = temp;

bottom = temp;

// second node and after, top is unchanged, but bottom is changed.

bottom.tokenNext = temp;

bottom = temp;

Phase 4 alternate: Save the output into an array.

Option 1:

  • It is much easier to save the tokens and their types into two different arrays:

String tokenName [] and initialize it to be null in all indices;

int tokenType[] and initialize it to be –1 for all indices.

  • Each time you print out the token and its type, you need to save them into these arrays of the same index.

tokenName[index]  token;

tokenType[index]  the value you printed out.

Option 2:

  • It is a little bit harder. You will save a token and its type into a 1-dimentional array of TokenNode.
  • Each TokenNode is a pair of information (token, type). In this case, you must create a class of tokenNode which is a set of pairs (token, type).

Class TokenNode

{

String tokenName;

int tokenType;

}

// declaration of the array

TokenNode tokenStream[];

You must implement it and write a paragraph to compare pros and cons between the two structures: linked list and array in this particular program Tokenizer.

Grading plan:

1)Documentation 2 points

2)Implementation strategy: 2 points

3)Code/test: 6 points

4)Bonus to do Phase 4/alt and comparison: 3 points.

Note about implementation:

  • During the course of implementation you might get a lot of error messages regarding about extracting and reducing a string.
  • In Java for example, you can not reduce string if it is a single-character string by using the built-in method: strExp = strExp.substring(1); This makes sense when you think strExp has one character at the index 0 and there is no index 1.
  • To overcome this difficulty, you must use if-statement to reduce the string to null:

if ( strExp.length == 1) {strExp = null;}

else { strExp = strExp.substring(1);}

spham (Tokenizer)Page 110/08/2018