UNIT -IV FORMAL LANGUAGES
Languages and grammar, Phrase Structure grammar, four classes of grammars, pumping lemma for regular languages, context free languages
4.1 INTRODUCTION
We study computation to learn about the fundamental principles that stands as a basis for practical applications of computing. The three important structures used in models of computation are grammars, finite state machines and turing machines. Grammars are used to generate the elements also called as words of a language. Formal languages, which are generated by grammars, provide models for natural languages like English and for programming languages. Grammars are very important in the theory and construction of compilers. We introduce various types of grammars and study certain relations between them.
4.2 LEARNING OBJECTIVE
- To be capable of writing grammars for a given language and also recognizing the language generated by a given grammar
- To be capable of checking whether a particular word is derivable by the given grammar
- To be capable of recognising the type of given language, in particular to check whether a language is not regular or not context- free.
4.3 BASIC CONCEPTS OF AUTOMATA THEORY
We shall discuss the basic definition of alphabet, string and languages with few illustrations.
ALPHABET
An alphabet is a finite, nonempty set of symbols (or characters) denoted by (to be read as sigma) or V.
Examples:
1. is an alphabet (of all lower case English letters) with symbols a, b,
c, …, z.
2. denotes the binary alphabet with symbols 0 and 1
3. is also an alphabet.
STRINGS:
A finite sequence of symbols (may or may not be repeated) chosen from some alphabet is defined to be a string (or word).
Examples.
1. Let we can form the strings 010, 0110, 111, 01 etc.
2. . The strings over are bat, cat, bet, eat, bee, beat and so on. But “car” is not a string over since the symbol r does not belong to .
EMPTY STRING:
A string having no symbol is called an empty string usually denoted by (epsilon) or (phi) or (lamda).
Notation :
The lower case letters near the end of English alphabet such as u,v,w, x, y, z are always used to represent a string.
LENGTH OF A STRING
Let be an alphabet. Let u be a string over , then the length of the string u is the number of symbols in the string (not necessarily distinct) and is denoted by .
For instance, if then u = 0101101 is a string over
u = 0 1 0 1 1 0 1
here we have seven symbols constituting the word u with repetitions then the length of u, .
If , then the strings of length 2 are given as aa, bb, cc, ab, ba, bc, cb, ca, and ac.
Clearly length of the empty string is 0. (no symbol to count)
NOTATION:
- the set of all strings over the alphabet
- set of all strings over excluding the empty string
be the strings of length n formed by the symbols in and
It is obvious that , a collection of all strings of length 0, length 1, length 2 etc., and .
Example:
Let
Let = the set of all two length strings formed by the symbols in
=
4.3.1 CONCATENATION OF STRINGS
The main operation on strings is concatenation. Let be an alphabet. Let x, y be two strings of then the concatenation of x and y gives a new string xy in by putting them together end to end.
Clearly
It is obvious that .
Example:
Let
Let ,
Then z = xy = 00101110101 is a string belonging to and
substring, Suffix and prEfix of a string:
If w = xuy, then u is called a sub string of w. String u is a suffix of v if v = xu for some string x . String u is a prefix of v if v = uy for some string y.
Example:
If w = aaabbbccd, be a string over then
u = abb is a substring of w whereas acd is not a substring of w.
If u = abb, then bb is a suffix of u and ab is a prefix of u.
REVERSAL OF A STRING
If is a string of an alphabet then the reversal of x is
. Clearly . If , then x is called a palindrome.
Example:
- x = abcda then xR = adcba, . But , x is not a palindrome.
- x = aba then xR = aba so you can check that x is a palindrome of length 3
- Concatenation: x = abd, y = hlm, then xy = abdhlm and yx = hlmabd
Reversal: x = abd, xR = dba
Exponentiation: Let x = abd then x0 = , x2 = xx = abdabd, x3 = xxx = abdabdabd
4.3.2 LANGUAGE:
A language L over an alphabet is a subset of . A language may be empty.
Here you should note a point ,a language containing an empty string is not an empty language.
Examples.
Let
1.
2.
3. = {000, 001, 010, 011, 100, 101, 110, 111}.
4. L = { the set of all strings of length 4 that begin with a and end with b over {a,b,c}}
= { aaab,abb,abab,abbb,aacb,acab,accb,abcb,acbb}
operation on languages:
Since languages are subsets of , the usual set operations like union, intersection, difference, complement can be performed .
Consider two languages L1 and L2 over the alphabet . then,
1. Union :
2. Intersection :
3. Complementation : , for a language L
4. Concatenation: Another interesting operation is concatenation.
L1 L2 = , a language consisting the set of all concatenation of strings in L1 with strings in L2.
Note:
1. Concatenation operation is not commutative.
Let , L1={ a,ab}, L2={ b,abb}. L1 L2= { ab,aabb,abb,ababb} and
L2 L1= { ba,bab,abba,abbab}. Hence L1 L2 L2 L1.
2. However concatenation operation is associative.
Consider languages L1 , L2 and L3 over the alphabet
L1 ( L2 L3 ) = (L1 L2 ) L3 which implies that products can be written without using parenthesis.
Powers of languages. Given a language L, let L0 = { } , Ln = L Ln-1 for n > 0
Example:
If and L = { a,bb}, then L0 = {) L1=L, L2= LL = { aa,abb,bba,bbbb}
5. Kleen closure (or Kleen star) of a language L over is defined as
and the positive closure of L also called as Kleen positive is the language denoted by .
Note: . However need not be equal to . Consider . Let L = , then
One can show that
PROPERTIES OF CLOSURE OPERATION:
Let be two languages over , then
1.
2.
HAVE YOU UNDERSTOOD THE CONCEPTS ?
Answer the following:
- Is an alphabet? Ans: yes
- What is the length of the string abcdefda Ans: 8
- If x = abcba, y = bcaba
- What is the concatenation y with x Ans: bcabaabcba
- Is x = y No
- Is x a reversible string Yes
- Is ? Is y reversible? =5,y not reversible
4. What is the language that generates the strings
Ans:L={an/ n is a prime number }
4.4 PHRASE STRUCTURE GRAMMARS ( Type- 0 grammar)
In section we learn the notions and definitions of a phrase-structure grammar and phrase structure language with some examples. From this section we will denote the alphabet by V.
4.4.1 PHRASE-STRUCTURE GRAMMAR
A phrase-structure grammar (PSG) denoted by G consists of four components,
G = (V, T, P, S), where: V is a finite set of symbols called non-terminals,
T is a finite set of terminal symbols, P is a finite set of productions with rules of the form, u and v are strings over the alphabet with the restriction that u is not the empty string. S a specific nonterminal is called the start symbol. Such a grammar is always referred as an arbitrary grammar or type-0 grammar.
A string made up of terminals and /or nonterminals is called a sentential form. By derivation we mean that if x and y are sentential forms and is a production then the replacement of u by v in xuy and it is denoted as
Notation: derives in one step
derives in one or more steps
derives in zero or more steps.
Examples:
1.Consider the grammar G = ({S}, {a,b,c}, P, S) where P is given by
. These productions are also written in the short form as
2. Consider the grammar G = ({S,A,B}, {a,b}, P, S) where P is given by
. These productions are also written in the short form as
consider . This statement is written as
consider . This shows that the string aab is obtained using two different derivations.
Note:
When a grammar contains more than one terminal, it may be possible to find different derivations of the same string.
4.4.2 PHRASE-STRUCTURE LANGUAGE:
A language generated by a phrase-structure grammar is said to be a phrase-structure language, i.e., if G = (V, T, P, S) is a PSG, then
is the phrase-structure language generated by G.
Example 1:
Consider a PSG, G = (V, T, P, S) with V = {S, A}, T = {a, b}
= ,
consists of four production rules, S the start symbol, then what is L(G)?
Solution:
The grammar works as follows.
Starting with S, applying rule, (n-1) times we get an-1S,
Applying the rule we get anA,
Applying the rule , m-1 times we get the string anbm-1A,
Finally applying the rule we get anbm
It is easy to follow that a collection of words of the anbm with any number of a’s followed by any number of b’s forms the language generated by the grammar G.
Example 2:
Consider the grammar G = (V, T, P, S) where
S – the start symbol
It can be shown that
Above two examples implies the following:
Note:
A language can have more than one grammar.
Example 3:
Find the language generated by the grammar:
Solution:
starting with S production
Note:.
Given a grammar G, framing L (G) is discussed..
Now, the natural question that arises is given a (PSL) language L (G) how to frame the corresponding grammar G:
Example 1:
If is a language over . To construct a grammar
G = (V,T, P, S) such that .
Solution:
Define where P consists of . Then , which implies . i.e., . Therefore . Let . Assume that in n steps. If n = 1, then . If , then the production is to be used for the first (n-1) steps, while the production is to be used in the (final) nth step. Now it follows that. Thus, then for some. Hence . Hence
.
In particular the derivation step for aabb is given by
Example 2:
Construct a suitable grammar for .
Solution:
Let generates L.
Claim
In all words number of b’s are double of number of a’s.
4.4.3 COMBINATION OF GRAMMARS:
To write a grammar for the language L = {, a,b,aa,bb,aaa,bbb,… an, bn, ….}
It is clear that L is union of the languages M ={, a,aa,aaa, … an, ….}and
N={,b,bb,bbb,… , bn, ….}
Grammar for M is
Grammar for N is
The grammar for L=MN is ,,.
UNION RULE:
In general if M and N are two languages whose grammars have disjoint sets of nonterminals by renaming them if necessary. Let A and B be start symbols of M and N , then the language for MN starts with the two productions .
PRODUCT RULE:
The language MN starts with the production
To write a grammar for the language L = { ambn / m, n are non-negative integers}
It is clear that L is product of the two languages M ={, a,aa,aaa, … an, ….}and
N={,b,bb,bbb, bn, ….}.
Grammar for M is
Grammar for N is
The grammar for L = MN is ,,.
Examples:
- Let and the language L be the set all palindromes over {a, b}Find a grammar G such that L = L(G).
Solution:
Define where P consists of the productions given by
. Obviously a palindrome of even length can be derived using and . If w is a palindrome of odd length with centre-marker a, then the production can be used to derive w. When the centre-marker is b then we have to use as the final production. Thus G is the required grammar.
- Find the grammar generating
Solution:
Define where P consists of .
i.e., . Also. So . Hence.
Let. Let in k steps. In the first step either the production or the production should be used. As, in the last step, we have to use the production. It follows that. We observe that, cannot be used in all the steps, since S cannot be eliminated in the resulting string. S can be eliminated only by using). If the production should be used in some step, say (m+1)th step for some , then the non-terminal S will not appear in subsequent steps and only the production should be used in m+2, m+3, …, k-1 steps. Note that has been in the first m steps. Thus . Hence we have .
- Construct a grammar G for the language
Solution:
Define where P consists of . We first show that .
. Hence .
(by applying once)
(by applying (n-2) times)
(by )
(by applying (m-1) times)
(by applying finally).
Hence for .
Let . Let in k steps. As , it follows that . The production must be used in some step, otherwise S, a variable, will remain till the end. If the production is used in a step, S will be eliminated at that step and only the productions and can be used in subsequent steps. As the production eliminates the variable A, it should be used only at the last step. Thus in deriving the word w,
the production is used in nth step for some,
the production is used in first n-1 steps
the production is used in nth to (k-1)th steps
the production is used in the last step.
Thus
in the (n-1)th step.
in the nth step using S bA
in the (k-1)th step.
in the kth step.
So for some .
Thus
- Construct a grammar for the language .
Solution:
Define where P consists of . The construction will be clear if we realize that . and can be used to generate . For getting in we can use the productions. Now
(by applying (i-j-1) times)
(by )
(by applying (j-1) times)
(by applying finally)
Hence when. The reverse inclusion can be proved as in earlier examples.
- Obtain a grammar for the language.
Solution:
Before constructing the grammar we note that where , , . is the language given in the previous problem. Now let us define grammar G accepting the given language L
where P consists of
………… (1)
………… (2)
………… (3)
…………. (4)
By the previous example , if and only if. The productions given in (3) together with and generates a word in. Hence if and only if. Obviously if and only if. By (1) , we see that . Hence .
- Find the language generated by the grammar where , ,
Solution:
We show that .
using once, (n-1) times and finally once. Hence .Also by applying m times, where . So .
Hence .
The reverse inclusion can be proved
7. Construct a grammar G such that .
Solution:
Define where P consists of .
To prove that G accepts the given language, we prove the following by induction on .
- if and only if w consists of an equal number of a’s and b’s.
- if and only if w consists of one more a than the number of b’s.
- if and only if w consists of one more b than the number of a’s.
(i), (ii), (iii) are true when . For and a is the only string of length one derivable from A. Similarly b is the only string of length one which can be derived from B. Also no string of length one is derivable from S. Thus there is a basis for induction.
Assume (i), (ii), (iii) to be true for all strings of length k-1. Let .
We prove ‘only if’ part of (i). Let . Then the first production has to be either or . If the first production is , then . Hence , and . By induction hypothesis, (iii) is true for. This means has one more b than the number of a’s. Hence has equal number of a’s and b’s. The proof is similar if the first production is .
We list some languages along with a grammar for each language.
Language / Grammar/ ( {S, A}, {a, b},P,S})
,
/ (V, T, P, S) where
set all palindromes over the alphabet{a, b} / where P consists of the productions given by
{a,ab,aab,aaab} / ,
{ac,abc,ab2c,…abnc….} /
Have you understood the concepts?
ANSWER THE FOLLOWING:
- The PSG or simply a grammar . What is language generated by G.
- If then L is a language over the alphabet
- If then is a language over the
alphabet
- If , then what is .
5. If
6. generates the language . True or
False?
Ans : 1. 2. (c) 3. (a) 4. empty 5. (a) 6. true
4.5 Classification of grammars:
In this section we focus on the four main classifications of grammars namely
type 0, type 1, type 2 and type 3 grammars.
TYPE 0 GRAMMAR
A general phrase structure grammar is called a type 0 grammar or an arbitrary grammar.
TYPE 1 GRAMMAR (context-sensitive grammar)
A PSG G = (V,T,P,S) is called a context-sensitive grammar (CSG); if each is such that u = u1Au2, v = u1xu2, where , , . A CSG is also called a type-1 grammar.
A is replaced by x in the context of u1 and u2 hence the name context-sensitive.
u1Au2 → u1xu2
The phrase-structure language, generated by a context- sensitive grammar is a context sensitive language (CSL).
Suppose for every rule we have then the PSG is said to be length-increasing. Now it is easy to understand a grammar is CSG if it is length increasing. In other words a grammar is CSG if the right hand side of any production is atleast as long as its left hand side.
Example:
Consider the grammar G = (V, T, P, S) where V = {S, B}, T = {a, b, c}
P consists of the following productions (rules)
S → aSBC, S → abc
CB → BC, bB → bb
Then is a CSL. We also observe that the rules in P obeys
Why is this grammar context-sensitive?
Observe the rule CB → BC, CB is replaced or rewritten as BC only if we have the context CB in the left side of the rule. Like wise the rule bB → bb, B is replaced by b
In the context of ab as its left member, other wise the rule cannot be applied.
type 2 grammar (CONTEXT-FREE GRAMMAR):
A context-free grammar consists of four components, G = (V, T, P, S) where,
V is a finite non-empty set of symbols, called the non-terminals
T is the terminal alphabet, , , the start symbol
P a finite non-empty set of rules of the form
Type 2 language(CONTEXT-FREE LANGUAGE)
The language L(G) generated by a context-free grammar G is called a context-free language(CFL).
Note:
It is very clear from the definitions that every CFL is CSL.
Examples:
1. Let G = ( { S,A,B,C } , {0.1}, P,S ) where P: S → A | B | C, A → 0C |
B → 1B | 1 , C → BA.
Find L (G) . Is G a CFG ?
To find L(G)
S → C
BA
B0C
(B0)n C
(B0)n BA
(B0)n B, for some n > 0
S → B
1B
11B
1n B
1n+1
S → A
S → A
0C
0BA
0B0C
0 (B0)n C
0(B0)n BA = (0B)n+1
Hence we get
L(G) = { w in {0,1}* | each 0 in w is followed immediately by 1 }.
As the left hand side of each production is a single variable, G is context-free grammar
2. Let G = (V, T, P, S)
V = {S}, T = {a, b}
P = {S → SaSaSaS, S → SaSaSbS, S → SbSaSaS, S → }
Then
3. Let G = (V, T, P, S), V = {S}, T = {a, b}
, then
4. Show that G = (V,T,P,S), where V = { E,T,F}, T { a ,+, *, ( , ) },
P= { E E+T| T, T T * F | F, F (E) | a } is CFG.
Proof:
As the left hand side of each production is a single variable, G is context-free grammar.
TYPE 3 grammar (REGULAR GRAMMAR)
A phrase-structure grammar G = (V, T, P, S) is called a regular-grammar if each rule is such that .
Regular grammars are called as type 3 grammars or right-linear grammar.
Type 3 language (REGULAR language)
A language generated by a regular grammar G is called a regular language or a regular set (REG or RL).
Note:
Regular languages are CFL but the converse need not be true.
Examples:
1. Any finite language is regular. This implies that the class of finite languages is a subset of Regular languages.
Proof:
Let L = { a1, a2, a3,…,an }be a language with n elements.
Consider the grammar G = (V, T, P, S), Where V = S, T ={ a1, a2, a3,…,an }and
P = { S a1, S a2, … ,S an }.
Clearly S is a regular language.
2. G = (V, T, P, S), V = {S, A}, T = {a, b}
. Then which is a regular language.
3. G = ( { S, A, B, C, D, E }, {a}, P, S) where P = { S ACaB, CBDB, aDDa, AEEa, Ca aaC, CBE, ADAC, AE.
As CBE is a production in G , G is not CSG and hence not CFG or Regular Grammar.
Hence G is of type 0 grammar.
4. The language L = { an bn | n 0 } is not a regular language . The grammar which is not regular for this language is S | aSb.
This language is CFL but not regular.
Hence the set of regular languages is a proper subset of all CFL.
5. The language L = { an bn cn | n 1 }is CSL but not CFL
A Grammar generating L is given by:
S aSBC|aBC, CB BC, aB ab, bB bb, bC bc, cC cc.
Observe that the left hand side of the productions are not all single nonterminals.
Hence the set of CF languages is a proper subset of all CSL.
4.5.1 Chomsky heirarchy:
Every regular grammar is a context-free grammar, every context-free grammar is a context-sensitive grammar and every context-sensitive grammar is a phrase-structure grammar.
Hence we have the following hierarchy given by Noam Chomsky.
REG CF CS PS where REG stands for the family of regular languages, CFL stands for the family of context-free languages, CS stands for the family of context-sensitive languages and PS stands for the family of phrase-structure languages.
context-free but not regular