COM2010: Functional Programming
Lecture Notes, 2nd part
P Green, M Gheorghe, M Mendler
17. Recognisers and Translators
Contents
17.1 Finite State Machine (FSM)
17.2 Translator
17.3 Parser
There are some specific mechanisms for recognising words or sentences (regular expressions, context-free grammars) or for translating them into other things. We shall present some of these mechanisms and how they may be codified in Haskell.
For a given set S we may define sequences of symbols over S. More precisely, for
S={s_1, …, s_n}
x=x_1…x_p is a sequence of symbols over S if x_k is from S for any k=1..p. We denote by Seq(S) the set of all sequences over S.
For example if Letters is the Latin alphabet, {‘a’..’z’, ‘A’..’Z’}, then the following sequences are sequences of symbols over Letters (belong to Seq(Letters))
John home word
whereas
word1 long_sentence
aren’t (don’t belong to Seq(Letters)).
In general not all sequences are of a particular interest and in the set Seq(S) is identified a (proper) subset called language which has some specific properties. We shall address those languages defined by some syntactical rules. When S is an alphabet then a specific language is a vocabulary associated to S and some rules defining words. A given vocabulary V, may be considered as a set S instead, and in this case a specific language may be considered as being the set of sentences over V constructed according to some rules.
17.1 Finite State Machine (FSM)
A FSM has a heterogeneous structure containing states, labels, and transitions. Since now on we have to deal with only deterministic FSMs, called simply FSMs. By using the polymorphic
type SetOf a =[a]
we may define a FSM thus:
data Automaton =
FSM (SetOf State)
(SetOf Label)
(SetOf Transition)
InitialState
(SetOf State) – set of final states
where
type State = Int
type Label = Char
data Transition = Move State Label State
type InitialState = State
Note. Automaton is an algebraic data type.
Example
where 0 is the initial state and 3, 7 are final states. This is defined in Haskell as
automatonEx = FSM [0..7] ['a','b','c']
[Move 0 'a' 1, Move 1 'b' 2,
Move 2 'c' 1, Move 2 'a' 3,
Move 3 'b' 3, Move 0 'b' 4,
Move 4 'a' 5, Move 5 'c' 7,
Move 4 'c' 6, Move 6 'a' 7]
0 [3,7]
In order to match a string against a FSM it’s required to start from the initial state and then find a path leading to a final state. For example “abcbab” is recognised by the above FSM as we may start from 0 by recognising ‘a’ then go to 1 where ‘b’ is recognised and so on until arriving in 3 where the last ‘b’ is recognised and the next state, where the path stops, is still 3 which is a final state.
Various components of a FSM are obtained by using some select functions:
tr :: Automaton -> SetOf Transition
-- all transitions of an automaton
tr (FSM _ _ t _ _) = t
and for transitions
inState :: Transition -> State
-- input transition state
inState (Move s _ _) = s
outState :: Transition -> State
-- output transition state
outState (Move _ _ s) = s
label :: Transition -> Label
-- transition label
label (Move _ x _ ) = x
With the function below we may get all the transitions emerging from a state s and labelled with the same given symbol x:
oneMove :: Automaton -> State -> Label -> SetOf Transition
oneMove a s x = [t| t <- tr a, inState t == s, label t == x]
where a list comprehension is used with some conditions imposed to transitions t of the FSM a.
A recogniser that matches an input string against a FSM starting from a state s, is recursively defined thus
recogniser :: Automaton -> State -> String -> State
recogniser a s xs
-- 0 or > 1 transition; ret. a dummy state
| length ts /= 1 = -1
-- no further inputs; returns next state
| tail_xs == [] = os
-- still inputs to be processed
| otherwise = recogniser a os tail_xs
where ts = oneMove a s (head xs);
tail_xs = tail xs;
os = outState (head ts)
The next function shows how a string is recognised by a FSM following a path from the initial state to a final state
acceptor :: Automaton -> String -> Bool
acceptor a xs = isFinal a (recogniser a (inS a) xs)
where
isFinal :: Automaton -> State -> Bool
-- check whether or not a state is final
isFinal a s = s `elem` fs a
fs :: Automaton -> FinalStates
-- all final states
fs (FSM _ _ _ _ f) = f
inS :: Automaton -> InitialState
-- initial state
inS (FSM _ _ _ s _) = s
If we consider the automaton defined above we get
acceptor automatonEx "abcbab" Þ True
which says that automatonEx recognises the input string “abcbab” by traversing a path starting from the initial state and stopping in a final state.
17.2 Translator
Any recogniser may be transformed into a translator by adding some mechanisms for getting out symbols. The output symbols may be associated with inputs such as for any input symbol recognised a suitable output symbol is sent out. For the automaton defined in 17.1 we may associate a translator as follows
Consequently for the input “abcbab” which is accepted by this automaton a corresponding output is produced, namely “xyzyxy”.
The automaton with outputs may be defined by extending the definition of a FSM with suitable output symbols.
data AutomatonO =
FSMO(SetOf State)
(SetOf InputLabel)
(SetOf OutputLabel)
(SetOf Transition)
InitialState
(SetOf State) – set of final states
where
type InputLabel = Char
type OutputLabel = Char
data Transition = Move State InputLabel OutputLabel State
A translator may be thus defined
translator ::
AutomatonO ->(State,OutString) ->InString -> (State,OutString)
where InString and OutString are defined as String and denote the input and output strings, respectively.
In this case any of the equations defining translator contains tuples instead of states. The tuples are of the form (state,outSymbols), where outSymbols is a string collecting the output label of the current transition.
Exercise. Define translator and the associated select functions.
Another type of translator is defined by aggregating some inputs and sending them out in certain states. These translators are largely used to recognise lexical units or tokens of programming languages and are called in this case lexical analysers.
The next example shows a FSM which is able to iterates through a sequence of characters and identify in the final states 1 the identifiers (sequences of letters and digits with the first symbol being a letter) and in 2 the integer numbers (sequences of digits). These are the lexical units and are delimited by a space character ‘ ‘.
letter is any of ‘a’..’z’ or ‘A’..’Z’ and digit is any of ‘0’..’9’; letterDigit is either letter or digit.
For a string like “ident 453 Id7t” the above automaton may translate it into the following lexical units ident and Id7t which are recognised in the final state 1 and 453 recognised in state 2.
When a comment is recognised, a sequence starting with ’{-‘, ending with ‘-}’ and containing any characters in between, in final state 6, then it is discarded. For example the string “34 {-comment-}” produces only one token, 34
Important! In order to ease the process of recognising lexical units assume the tokens are always separated by spaces (‘ ‘) and consequently from every final state we should have a transition to the initial state labelled by ‘ ‘
The following definition is an extension of that given for a FSM which defines an extended automaton
data ExtAutomaton =
EFSM (SetOf State)
(SetOf Label)
(SetOf Transition)
InitialState
(SetOf State)
(SetOf FinalStateType) —new!!
where
type FinalStateType = (State, TokenUnit)
type TokenUnit = (Int,String)
It follows that the last line in the definition of ExtAutomaton contains a list of tuples (state, tokenUnit), with state being a final state where a lexical unit is recognised and sent out in tokenUnit. Every tokenUnit is a tuple where the first component is a code (an integer value used by the parser) and the last part is the lexical unit itself.
In our example only states 1 and 2 occur in the list of FinalStateType. The final state 6 is not in this list, and it follows that the tokens recognised in this state are discarded (these units correspond to comments).
The translator, which is called lexical analyser, will use a translation function defined thus
translation ::
ExtAutomaton -> (State,SetOf TokenUnit) -> InputSequence -> String-> (State, SetOf TokenUnit)
translation takes
· an extended automaton
· a tuple with the first component a state – in general the initial state – and the second part a list of token units – in general empty –
· an input sequence of characters
· a string where the current lexical unit will be collected; initially it is empty
and produces the last state where the translation process stops and the sequence of token units recognised.
Example. Let us assume that for identifier, identCode(= 1) and for number, noCode(= 2) are defined. If extAutomaton is the extended automaton corresponding to the last figure and the input is
"ident {-comment-} 346 lastIdent" then
translate extAutomaton (0,[])
"ident {-comment-} 346 lastIdent"[] Þ (1,[(1,"ident"),(2,"346"),(1,"lastIdent")])
So the translation stops in state 1, which is a final state where lastIdent has been recognised and produces the following token units:
(1,"ident") (2,"346") (1,"lastIdent")
translation function is defined by the following algorithm:
· when the input string is empty it stops by producing the current state and the list of token units
· otherwise (input string is not empty)
o if the character in the top of the input string is not ‘ ‘ then it is added to the string collecting the current lexical unit and translation resumes from the next state, current string collected, and the rest of the input string
o (current character is ‘ ‘) the previous state is in the list of FinalStateType then a token unit is recognised and added to the list of token units and translation resumes from the next state, with an empty string where next lexical unit will be collected, and the rest of the input string
o otherwise (the previous state is not in that list) the collected token is discarded and translation resumes from the next state, with an empty string where next lexical will be collected, and the rest of the input string
17.3 Parser
Parsing a program means passing through the text of the program and checking whether the rules defining the syntax of the programming language are correctly applied. In fact parsing comes immediately after lexical analysis and consequently processes a sequence of token units rather than the initial sequence of characters defining the program.
The syntax rules may be given in various forms: context-free rules, EBNF notation or syntax diagrams. All these notations are equivalent but the last two provide more conciseness than the former.
Let us consider a very rudimentary imperative programming language, called SA (Sequence of Assignments), consisting only of assignment statements delimited by ‘;’. Each assignment has also a very simple form (identifier = number or identifier = identifier).
We also assume that every program should end with a specific lexical unit called ‘eop’ (lexical analyser will be responsible for adding this bit).
We may define the syntax of SA with the following set of syntax diagrams:
1. Program::=
2. StmtList::=
3. Assign::=
4. RestAss::=
5. Exp::=
6. Trm::=
7. Operator::=
8. Eop::= eop
9. Delim::= ;
10. Identifier::= ident
11. AssSymb::= =
12. Number :: =no
13. AddOp::= +
14. MinOp::= -
15. LHandS::= ident
Observations
· three main diagrams may be distinguished: sequence (1,3,4), alternative (5) and iteration (2)
· any of these diagrams has two (non-terminal) symbols
· the last diagrams ( 6 to 11) are sequence diagrams but with only one (terminal) symbol, corresponding to main lexical units (in this case ident, no, ;, =, eop)
· a simpler specification may be obtained (try and find it!) but this is a kind of “normal form” which will ease writing the parsing functions.
The following more general case, with four diagram types could be addressed:
Sequence::=
Alternation::=
Iteration::=
Term::=
In order to be able to write a deterministic parser (without backtracking) the corresponding equivalent grammar should be LL(1), which means that the diagrams for alternation and iteration should possess the following properties:
· (alternation) X and Y should derive disjoint sets of terminals on the first position – for SA (diagram 5), ExpId derives {Ident} and ExpNo derives {No}
· (iteration) Y and the non-terminal that follows after Iteration should derive disjoint sets of terminals on the first position – for SA (diagram 2), Delim derives {;} and the nonterminal after StmtList is Eop which derives {eop}
Any function f involved in parsing is defined as
f:: SetOf TokenUnit -> SetOf TokenUnit
and will refer to the top element in the list of token units.
The parsing function for Sequence diagram
seqOf ::
(SetOf TokenUnit -> SetOf TokenUnit) ->
(SetOf TokenUnit -> SetOf TokenUnit) ->
SetOf TokenUnit -> SetOf TokenUnit
-- seqOf fX fY processes ->X -> Y->
seqOf fX fY = fY.fX -- composition
The parsing function for Alternation diagram
altOf ::
(SetOf TokenUnit -> SetOf TokenUnit)->
SetOf TokenUnit ->
(SetOf TokenUnit -> SetOf TokenUnit)->
SetOf TokenUnit ->
SetOf TokenUnit -> SetOf TokenUnit
-- altOf fX fY processes X or Y
altOf _ _ _ _ [] =
error ("Input: empty/ Alternative ")
altOf fX fXTUs fY fYTUs ts@(t:ts')
| fst t `elem` map fst fXTUs = fX ts
| fst t `elem` map fst fYTUs = fY ts
| otherwise = error("Input: "++
show t++"/ Expected: "++show(head fXTUs)