The Exercises Included in This Assignment Total to 160 Points

CPSC 503

Assignment #3

Assigned: Feb 20

Due Date: Mar1 (11pm)

The exercises included in this assignment total to 160 points.

Please:

- send me an email with your answers in txt format (by Mar 1 (11pm))

- leave a copy on your answers in my mailbox (by noonMar 2)

- work on this alone

- no plagiarism!

- do not delay working on the assignment until the last minute.

- if you have any clarification question, post it on the course mailing

list

Note: All files mentioned are available at the course web page

(1)Part of speech tagging (30pts):

Assume that you have found a file named "kingarthur-brill-tagged.txt" which you know has been tagged using the Brill tagger. The Brill tagger uses thePenn Treebank tagset (see the textbook pag. 297 – also check book errata!),

a. (5pts)

Identify all mistakes made by the tagger. What is itsaccuracy on kingarthur.txt?

(Suggestion: remove the tags to read the text)

b. (5pts)

Choose a set of three errors made by the tagger and for eachof them, explain how additional knowledge at the lexical, syntactic,and/or semantic level could have avoided the mistake.

c. (10pts)

Assume that you have found another file named “AP-news-brill-tagged.txt” which again has been tagged using the Brill tagger.Identify all mistakes made by the tagger. What is its accuracy on this file? Compare this accuracy with the accuracy on the kingarthur-brill-tagged.txt file and explain any differences.

d. (10pts) Look at the second sentence of the kingarthur-brill-tagged.txtfile. Identify all words that can potentially have more than one partof speech tag associated with them (in general, not in this sentencein particular). The LEXICON.BROWN file can help you here. Suppose that

you have a random part of speech tagger which picks one of thepossible categories for each word at random (i.e., uniform distribution) regardless of context and/orlexical probabilities. What would be the accuracy of that random tagger, onaverage, on the sentence?

(2)N-grams (applied to part of speech tag sequences):

Thesequestions will use two data files: atis3.pos (the ATIS part of speech tagged corpus) and wsj3_00.pos (Section 0 from the Penn Treebank Wall Street Journal part of speech tagged corpus). The file wsj3_00.pos will be our source of training data and the file atis3.pos will be our source of testdata.

a.(45pts) The supplied Perl program clean-tagged.pl creates the files wsj3_00.pos.tags and wsj3_00.pos.emit from wsj3_00.pos. It also creates the files atis3.pos.tags and atis3.pos.emit from atis3.pos.

The supplied Perl programs skip.pl, paste.pl, clean-ngram.pl, ngramCounts.pl and

ngramLogProb.pl produce a bigram model of the part of speech tag sequences from the file

wsj3_00.pos.tags (this is our training data). The output of ngramLogProb.pl will be a bigram model of tag sequences stored as log probabilities: log2 P(ti | ti−1).

The file Qa.pl. specifies how to run those files to create the bigram model.

For this question you do not need to write any code. Your task is to understand the code in these files (what they do and why). Please, add to these files as many comments as you feel are necessary (at least ‘on average’ five per file) so that someone else with little Perl experience can look at them and quickly understand them. As you know, comments should focus on important/cryptic aspects of the code (e.g., what a key data structure is used for, what a key loop does). Please do not change the code in anyway.Submit filesskip.pl, paste.pl, clean-ngram.pl, ngramCounts.pl andngramLogProb.pl with your comments.Also write a brief explanation of the structure of the files they produce.

b.(50 pts)Let T = s0, . . . , sm represent the test data with sentences s0 through sm.

P(T) = Product(i=0tom)P(si) = 2Sum(i=0 tom ) log2 P(si)

log2 P(T) =Sum (i=0tom ) log2 P(si)

where log2 P(si) is the log probability assigned by the bigram model to the sentence si (note that in

this homework, each si is a sequence of part of speech tags). These log probabilities are provided byngramLogProb.pl (see Question 2a). Let WT be the length of the text T measured in part of speechtags. Thecross entropy for T is:

H(T) = −1/WT (log2P(T))

The cross entropy corresponds to the average number of bits needed to encode each of the WT wordsin the test data. The perplexity of the test data T is defined as:

PP(T) = 2H(T)

Write a Perl program to compute the cross entropy and perplexity for a given input file. The input tothe program is a bigram model over part of speech tag sequences and an input file with sentences thatare part of speech tag sequences.

Using this program, print out the cross entropy and perplexity for the training datawsj3_00.pos.tags and the test data atis3.pos.tags.

On the test data, when a bigram in unseen, the probability for that bigram is zero. However, since weare using log probabilities, we cannot use a probability of zero (as log2 (0) is not defined). Instead,use the value log2 P(ti | ti−1) = −99999, when a bigram (ti−1, ti) is unseen.

Remember that cross entropy and perplexity are both positive real numbers, and the lower the values,the better the model over the test data.

Note: comment your code

c.(35 pts) Implement add-one smoothing to provide counts for every possible bigram (ti−1, ti). Using ngramLogProb.pl and your Perl program from Question 2b, recompute the cross-entropy and perplexity for the training data wsj3_00.pos.tags and the test data atis3.pos.tags. Provide the output of the program. Do not smooth the unigram model P(ti).

Note: comment your code

Output you should get for questions 2b-c:

no smoothing: training data

file length=47427

file log prob=-156555.892928136

file cross entropy=3.30098663057194

file perplexity=9.85589325276375

no smoothing: test data

file length=5507

file log prob=-8619453.07170324

file cross entropy=1565.18123691724

file perplexity=inf

add-one smoothing: training data

file length=47427

file log prob=-157925.748497338

file cross entropy=3.32987008449487

file perplexity=10.0552014788187

add-one smoothing: test data

file length=5507

file log prob=-21935.6382365245

file cross entropy=3.98322829789805

file perplexity=15.8150728671005