Phrase-Structure Grammars and Languages

Formal Grammars and Languages

N.Chomsky, 1950’s

Theory of Generative Grammars (Phrase-structure Grammars)

 The origin was for the mathematical (qualitative) study of linguistics.

 It turned out in 1960’s that the theory was useful for the study of programming languages, too. Now it is one of basic theories for theoretical computer science.

Why Phrase-structure Grammar?

There are dependencies among the phrases constituting sentences.

S

NP Aux VP

Det N  V NP

the boy love PN

Mary

Generation of sentences according to phrase-structure rules

S  NP Aux VP (SNP Aux VP)

 Det N Aux VP (NPDet N)

 the NP Aux VP (Detthe)

 the boy Aux VP (NPboy)

 the boy VP (Aux)

 the boy V NP (VPV NP)

 the boy love NP (Vlove)

 the boy love PN (NPPN)

 the boy love Mary (PNMary)

generation by rewriting phrase-structure rules

’Grammar’ can be considered to be a set of phrase-structure rules:

Relations (dependencies) among grammatical categories:

SNP Aux VP, NPDet N | Det AP N | PN,

VPV NP | V | V PP APAdj AP | Adj,

PPPrep NP, ……

Vocabulary:

Nboy | girl | dog | …, Vlove | read | study | …,

PN Mary | Jack | …, Auxbe | will | may | …,

Detthe | a | an | …, Adjpretty | small | red | …,

Prepon | in | for | …, ……

Phrase-Structure Grammar

A rewriting system, a modification of the Thue system (1906)

G=(N,T,P,S), where

1) N is a finite alphabet (the set of nonterminals),

2) T is a finite alphabet (the set of terminals),

3) P is a finite subset of V*NV*V*, where V=NT. Each element, say (u,v), of P is called a production or rewriting rule, and written uv, and

4) S is a specified symbol of N (the sentence symbol or start symbol).

Rewriting in G (a binary relation, , on V*)

 G ’ or simply   ’ ( directly derives ’) if =u and ’=v for some uv in P.

 is said to derive  if *, where * is the reflexive and transitive closure of .

A derivation of  from  is *, i.e., either = or =01n= for some 1,…,n.

Recall that + and n are the transitive closure and the n-th power of , respectively.

The language G generates

L(G)={xT* | S *x}

G and G’ are said to be equivalent if L(G)=L(G’).

X1|…|n is an abbreviation of X1, …, Xn.
Four Types of Phrase-Structure Grammars

(according to the forms of productions)

G=(N,T,P,S), V=NT

type 0 (unrestricted): No restriction

type 1 (context-sensitive, CS): Each production in P is of the form X, where ,V*, XN, and V+

type 2 (context-free, CF): Each production in P is of the form X, where XN and V*

type 3 (right-linear or regular): Each production in P is of the form XaY or X, where X,YN and aT

Let ℒi denote the class of type i languages.

Examples

  1. Let G1=({A,B},{0,1},P1,A), where P1={A0|1B, B0B|1B|0}.

G1 generates {0}{1}{0,1}*{0} or 0+1(0+1)*0, the set of binary representations of even nonnegative integers. G1 is of type 3 or right-linear. Thus L(G1) is in ℒ3. An example of derivation in G1:

A1B10B101B1010, and therefore we have A*1010.

  1. G2=({S},{a,b},{SaSb|},S) generates a nonregular language AnBn. Since G2 is a type 2 (or context-free) grammar (CFG), AnBn is a type 2 (or context-free) language (CFL). As you have learned already, this is a nonregular language. In fact, AnBn is in ℒ2-ℒ3.

SaSbaaSbb…anSbnanbn for any n1. For n=0, we have

S=a0b0. Thus S*anbn for any n0.

  1. Let N={S,A,B,C}, T={a,b,c}, and P3={S, SA, AaABC, AaBC, CBBC, aBab, bBbb, bCbc, cCcc}. G3=(N,T,P3,S) is a type 0 grammar generating a non-context-free language (which will be proved later) AnBnCn={anbncn | n0}.

An example derivation of a2b2c2 from the start symbol:

SAaABCaaBCBCaabCBCaabBCCaabbCCaabbcC

aabbcc.

It is known that AnBnCn is in ℒ1-ℒ2.

Type 3 Languages = Regular Sets

THEOREM The class of regular sets coincides with the class of type 3 languages, that is, ℒ3=ℛ.

Proof. ℒ3ℛ: For a given type 3 grammar G=(N,T,P,S), let M=(N{f},T,,S,{f}) be the -NFA defined by

(X,a) contains Y iff XaY is in P, and

(X,) contains f iff X is in P.

Then L(M)=L(G).

ℒ3ℛ: For a DFA M=(Q,,,q0,F), define the type 3 grammar G=(Q,,P,q0) as follows:

qap if (q,a)=p, and

q for each q in F.

(A similar construction is possible for NFAs and -NFAs.)

Then L(G)=L(M).

Examples

1. Converting DFAs, NFAs, or -NFAs to type 3 grammars:

2. Converting type 3 grammars to -NFAs:

Exercises

  1. What is L(G) for the following G’s ?:
  1. Give type i grammars generating the following languages:

3. Show type 3 grammars and finite automata for the following regular sets:

1

#006