Tirgul 5
How does Lexical Analysis Work?
Regular expressions define tokens, a finite automaton is built to accept those tokens,
read the input stream one character at a time, and return:
A. the longest acceptable string as the defined token,
B. in case of tie, pick the token generated by the first rule that accepts the string.
Proceed to the rest of the text and generate more tokens.
Why:
- to detect simple errors ,like 123.+5
- to remove whitespaces and other stuff that bother us.
- to abstract future processes.
Creating automata from regular expressions
We can view regular expressions as regular descriptors with production rules:
Digit → [0-9]
IntegerInString→ ["][1-9] {Digit}* | [0]["]
The states of the automata:
For each position of the "dot" a state in the final automaton is defined:
S1 = ·["] [1-9] {Digit}* | [0]["]
S2 = ["]· [1-9] {Digit}* | [0]["]
S3 = ["] [1-9] · {Digit}* | [0]["]
S4 = ["] [1-9] {·Digit}* | [0]["]
S5 = ["] [1-9] {Digit·}* | [0]["]
S6 = ["] [1-9] {Digit}*· | [0]["]
S7 = ["] [1-9] {Digit}* | ·[0]["]
S8 = ["] [1-9] {Digit}* | [0]·["]
S9 = ["] [1-9] {Digit}* | [0]["]·
The transitions of the automata:
Basic move:
· [1-9] → [1-9] · after reading [1-9] from the input
If the dot is before a basic pattern (not a regular operator) then the deterministic move of reading that pattern will move us to the next state.
Shift transitions:
- Epsilon transitions:
- Union operator
· (R1 | R2) →ε (·R1 | R2)
· (R1 | R2) →ε (R1 | ·R2)
(R1· | R2) →ε (R1 | R2)· - Kleene star operator
·(R1)* →ε (R1)*·
·(R1)* →ε (·R1)*
(R1·)* →ε (·R1)*
(R1·)* →ε (R1)*· - Other operators (such as R?, R+) you can define yourself.
- Deterministic transitions:
·R1 → R1·
Reduce transitions:
Assume R1, R2 are regular operators and
R1 → {something1}·R2 {something2}
we move to state R2→ {something3} until we reach the end of R2:
R2→ {something3}·
The reduce transition is an epsilon transition from that state to the state representing
R1 → {something1} R2 · {something2}
Building the none-deterministic finite automata:
- Every dot state (position) is a state in the automata
- If the dot is in front of basic pattern then SHIFT transition to the state where the dot is after the basic pattern.
- If the dot is in front or inside a regular operator then there are epsilon SHIFT moves as defined above.
- If the dot is before a complex pattern (contains regular operators and possibly basic patterns) then we perform epsilon move to a "mini" automata that accepts that pattern and a there is a REDUCE move from the accepting state of that automata to the state where the dot is after that pattern.
Small example:
building automata for the reg-exprs:
a(b)*
we create 5 states for each of these dot positions:
1a2(3b4)*5
state 5 is an accepting state.
Create the transitions, written as <state1, letter| ε , state2> :
<1 , a, 2>
<2, ε , 3>
<2, ε, 5>
<3, b, 4>
<4, ε, 5>
<4, ε, 3>
Bigger example:
Our reg-exprs are:
Foo → [0-9]* [x] [0-9]* [!]
Goo → [0-9]* [+] [0-9]* [!]
State 1 is the initial state and represents:
·[0-9]* [x] [0-9]*[!]
· [0-9]* [+] [0-9]*[!]
[0-9]*· [x] [0-9]*[!]
[0-9]*· [+] [0-9]*[!]
The transitions from this state are:
<1, [0-9], 1>
<1, x , 2>
<1, + , 3>
State 2:
[0-9]*[x]· [0-9]*[!]
[0-9]*[x][0-9]*·[!]
State 3:
[0-9]*[+]·[0-9]*[!]
[0-9]*[+] [0-9]*·[!]
The transitions:
<2, [0-9], 2>
<2, ! , 4>
<3, [0-9], 3>
<3, !, 5>
State 4:
[0-9]*[x][0-9]*[!]·
State 5:
[0-9]*[+] [0-9]*[!]·
Scheme Special Forms:
- let, letrec, let*
(letrec ((foo 5)
(goo 7))
(+ goo foo))
12
( (lambda (foo goo)
(set! foo 5)
(set! goo 7)
(+ goo foo) ) 'stam1 'stam2)
12
- begin – expressions evaluated one by one, the last is also returned.
- and, or.
- Lambda.
lambda expression is not evaluated, instead it creates a closure, which is a pair pointing to the current environment and to the code (the lambda body). Only upon application will the lambda body be evaluated (with an extension of the referenced environment).
;; > (define f (lambda (x) (+ x 1)))
;;
;; Global env (E)
;; +------+
;; | | <------+
;; | | |
;; | f ------> O=O
;; | | |
;; +------+ +-> parameters: x,
;; body: (+ x 1)
;;
;; > (f 2)
;;
;; Global env (E)
;; +------+
;; | | <------+
;; | | |
;; | f ------> O=O
;; | | |
;; +------+ +-> parameters: x,
;; ^ body: (+ x 1)
;; |
;; |
;; +------+
;; | x: 2 |
;; | |
;; +------+
;; (+ x 1)
;; 3
- Quote & Backquote (also called quasiquote).
’x = (quote x)
`x = (quasiquote x)
,x = (unquote x)
,@x = (unquote-splicing x)