Lex
This section will demonstrate how to write a simple lexical analyzer to identify integers, floats, keywords, and variables. It will also cover some basic techniques in error checking and other areas that may not be intuitive.
The first section include string library and defines a few variables that we will use later.
%{
#include <string.h>
char buf[500]; /* buffer for line */
int cur_tok;/* start of next token */
int token_id;
char token_text[64];
int n=1;
int end_reached=0;
%}
The next section defines a few variables that will be used in the rules section below.
- DIGIT is looking for any character 0 thru 9.
- ID is looking for any character ‘a’ thru ‘z’ followed by and character ‘a’ thru ‘z’, ‘A’ thru ‘Z’, ‘0’ thru ‘9’, or a underscore. The {0,16} at the end basically limits the number of characters to 16.
- KEY is looking for any number of characters ‘A’ thru ‘Z”.
DIGIT[0-9]
ID[a-z][a-zA-Z0-9'_']{0,16}
KEY[A-Z]*
The rules section basically defines each token to be recognized and calls executes the c code contained in the curly brackets. In this case the code simply prints out which token it found and returns. The “|” are used “ors” to define tokens that will be recognized with different combinations. For instance the ASCII token will be triggered if any one of the listed symbols is found.
%%
{DIGIT}+{//found a group of numbers - must be integer
addto_buf();
printf("TOKEN - Integer(%d)\n",atoi(yytext));
return;
}
{DIGIT}+"."{DIGIT}+{//found a float by looking for group of numbers dot numbers
addto_buf();
printf("TOKEN - Float (%g)\n", atof(yytext));
return;
}
"("|")"|"+"|"*"{//found an ascii character
printf("TOKEN - ASCII character ('%s')\n", yytext);
addto_buf();
}
"=="{//found a multi-ascii character
printf("TOKEN – Multi-ASCII character ('%s')\n", yytext);
addto_buf();
}
END|BEGIN{//found a keyword
printf("TOKEN - Reserved (%s)\n", yytext);
addto_buf();
}
{KEY}{//If the keyword not previously recognized, must not be one...
printf("**ERROR - Reserve word is not recognized (%s)\n",yytext);
addto_buf();
}
{ID}{//found a variable
printf("TOKEN - Variable (%s)\n", yytext);
addto_buf();
return;
}
This command simply absorbs whitespace and does not call a token.
[ \t]+{//ignore whitespace, but still put in buffer
addto_buf();
}
In a typical parsing application, it would not be unusual to want to print out each line as it is read. Because of the way lex buffers the input, it turns out to be a more complex task than expected. One way this can be accomplished is by buffering everything as it is read (using the addto_buf function seen in each token rule) and printing the buffer whenever an end-of-line is encountered (/n). The downside to this approach is that it will print the line after the tokens are recognized.
\n{ // if end of line is reached, print the buffer
//there is no good way to simply print each line - need to store everything as it is read
//then dump the buffer after each end-of-line
addto_buf();
printf("\nLine %d: %s\n",n,buf);
n++;
clear_buf();
}
Another common need is identifying areas of code covering multiple lines that need to be ignored. This can be accomplished by recognizing the beginning of such an area (in this case “/*”), and walking thru characters using the function input() until the end is reached. The following code does this for standard c comments and also adds some error checking in case the end of file is reached in the middle of a comment.
"/*"{//after finding the beginning of a comment, this code walks thru text until the end of comment
registerint c;
addto_buf();
for( ; ; ){
while( (c=input()) != '*' & c != EOF){ /*delete all of the comments*/
addchar_to_buf(c);
}
addchar_to_buf(c);
if(c == '*'){
while( (c=input()) == '*'){
addchar_to_buf(c);
}
addchar_to_buf(c);
if(c == '/') break; /*found end of comment*/
}
if(c == EOF){
printf("**ERROR - EOF in comment");
printf("\nLine %d: %s\n",n,buf);
n++;
break;
}
}
}
This last rule is basically a catch-all. It looks for any single character (denoted by ‘.’) that has not been identified above and prints out an error message.
.{//look for any single character, if not previously recognized...
printf("**ERROR - Unrecognized ASCII character (%s)\n",yytext);
addto_buf();
}
%%
//these functions provide the line buffer and print functionality
clear_buf()
{
buf[0] ='\0';
cur_tok = 0;
}
addchar_to_buf(int c)
{
char temp[2];
temp[0]=(char)c;
temp[1]='\0';
cur_tok=strlen(buf);
strcpy(buf+cur_tok, temp);
}
Notice that the variable “yytext” contains char string of the token that was found, and note that it gets reused for every token found. While this is not significant if the objective is to simply print out a token as in this example, it might be if you need to keep a copy. In this case you must do a “strcopy(newchar, yytext) or similar.
addto_buf()
{
cur_tok=strlen(buf);
strcpy(buf+cur_tok, yytext); /* append to buffer */
}
yywrap()
{
end_reached=1;
}
The main function keeps calling yylex() until end of file has been reached. yylex() is the function produced by lex from the definitions and rules listed above. In most situations this would be called from within the yacc-generated file.
main()
{
yyin = stdin;
while (end_reached !=1){
yylex();
}
}
This file (lex_1.l) can be found in the Part 1 folder and can be converted to c code, compiled and tested by typing the following:
lex lex_1.l
cc yy.lex.c –o parse
./parse < test1.txt