Log-Linear Learning in NLP Applications

Log-Linear Models in NLP Applications

Daniel Marcu

ACKNOWLEDGMENTS AND ATTRIBUTIONS: These lecture notes reproduce significant segments of the Tutorial Notes of Manning and Klein (2003) and Lecture Notes of Collins (2003).

1. Example problem: POS tagging

The/DT Canadian/JJ government/NN plans/VBZ to/TO auction/VB on/IN Tuesday/NNP 750/CD million/CD Canadian/JJ dollars/NN of/IN bonds/NNS.

Assume that we want to build a system that automatically assigns POS tags to input sentences. Assume that we traverse the text left to right and at any given word, we make a decision as to what tag to assign. What we would like to do is to make informed decisions, depending on the context/history memorized during the tagging process. When tagging, we would like to consider answering questions like:

What is the word that we are tagging right now? (If it is “the”, we suspect the probability of labeling it as DT is high).
What was the tag we assigned to the previous word in the sequence? (If it was DT, we are probably more likely to assign to the current word a JJ or NN than a VBD tag).

We would like to have a mechanism for enumerating all these “questions/features” that we assume to be useful for tagging and have a mechanism for automatically learning the weights that we should assign to these features when multiple features fire up. We would like to be able to use an arbitrary large number of features and to impose little or no constraints on the types of questions that the features model.

It is usually infeasible to memorize all the steps we take during the tagging process, so we usually work with a finite history/context:

Example

Input domain: X = < t i-3, t i-2, t i-1, w[1:n], i > // a history consisting of the previous 3 tags, all the words in the sequence to be tagged, and the word/position to be tagged.

Output labels: Y = {DT, NN, JJ, IN, NNS, …}

2. Generic conditional models

Given an input domain X, a finite set of labels Y, and many examples <X,Y>, learn a conditional probability P(y | x) for any x  X and y  Y.

Example training set

W./NNP Ed/NNP Tyler/NNP ,/, 37/CD years/NNS old/JJ ,/, was/VBD elected/VBN president/NN ./.

Stoltzman/NN has/VBZ taken/VBN a/DT gentler/JJR ,/, more/RBR audience-friendly/JJ approach/NN /./

…

Want to label any other sentence with POS tags

Mary has a little lamb .

3. Log-linear models

Given

An input domain X and a finite set of labels Y
A set of m feature functions k : X  Y →  (very often these are indicator/binary functions: k : X  Y → {0,1}).
The feature vectors (x,y) m induced by the feature functions k for any x  X and y  Y

Learn a conditional probability P(y | x W), where

W is a parameter vector of weights (Wm )
P(y | x, W) = e W ∙ (x,y) / ∑y’ Y e W∙(x,y’)

log P(y | x, W) = W ∙ (x,y) / - log ∑y’ Y e W ∙(x,y’) [substraction between a linear term and a normalization term]

Examples of feature functions for POS tagging:

1 (x,y) = {

2 (x,y) = {

3 (x,y) = {

It is natural to come up with as many feature functions (<history, output> pairs) as we can.

Learning in this framework amounts then to learning the weights WML that maximize the likelihood of the training corpus.

WML = argmax Wm L(W) = argmax Wm ∏ i = 1..n P( yi | xi)

L(W) = ∑ i = 1..n log P( yi | xi)

= ∑ i = 1..nW∙ ( xi , yi) / - ∑ i = 1..n log ∑y’ Y e W∙(xi , y’)

Note: Finding the parameters that maximize the likelihood/probability of some training corpus is a “universal” machine learning trick.

Summary: we have cast the learning problem as an optimization problem.

Several solutions exist for solving this problem:

Gradient ascent
Conjugate gradient methods
Iterative scaling
Improved iterative scaling

4. Maximum Entropy Models

An equivalent approach to learning conditional probability models is this:

There are lots of conditional distributions out there, most of them very spiked, overfit, etc. Let Q be the set of distributions that can be specified in log-linear form:

Q = { p : p(y | xi) = e W ∙ (x i ,y) / ∑y’ Y e W∙(x i , y’)

We would like to learn a distribution that is as uniform as possible without violating any of the requirements imposed by the training data.

P = {p : ∑ i = 1..n ( xi , yi) = ∑ i = 1..n ∑yY p(y | xi ) (xi,y)

(empirical count = expected count)

p is an n × |Y| vector defining P(y | xi ) for all i, y.

Note that a distribution that satisfies the above equality always exist

p( y | xi ) = {

Because uniformity equates high entropy, we can search for distributions that are both consistent with the requirements imposed by the data and have high entropy.

Entropy of a vector P:

H (p) = -∑ px log px

Entropy if uncertainty, but also non-commitment.

What do we want from a distribution P?
Minimize commitment = maximize entropy
Resemble some reference distribution (data)

Solution: maximize entropy H, subject to constraints f.

Adding constraints (features):
Lowers maximum entropy
Raises maximum likelihood
Brings the distribution further from uniform
Brings the distribution closer to a target distribution

Lets say we have the following event space:

NN / NNS / NNP / NNPS / VBZ / VBD

… and the following empirical data:

3 / 5 / 11 / 13 / 3 / 1

Maximize H:

1/e / 1/e / 1/e / 1/e / 1/e / 1/e

… but we wanted probabilities: E[NN, NNS, NNP, NNPS, VBZ, VBD] = 1

1/6 / 1/6 / 1/6 / 1/6 / 1/6 / 1/6

This is probably too uniform:

NN / NNS / NNP / NNPS / VBZ / VBD
1/6 / 1/6 / 1/6 / 1/6 / 1/6 / 1/6

… we notice that N* are more common than V* in the real data, so we introduce a feature fN = {NN, NNS, NNP, NNPS}, with E[fN] = 32/36

8/36 / 8/36 / 8/36 / 8/36 / 2/36 / 2/36

… and proper nouns are more frequent than common nouns, so we add fp = {NNP, NNPS}, with E[fp] = 24/36

4/36 / 4/36 / 12/36 / 12/36 / 2/36 / 2/36

… we could keep refining the models, for example by adding a feature to distinguish singular vs. plural nouns, or verb types.

Fundamental theorem: It turns out that finding the maximum likelihood solution to the optimization problem in Section 3 is the same with finding the maximum entropy solution to the problem in Section 4.

The maximum entropy solution can be written in log-linear form.
Finding the maximum-likelihood solution also gives the maximum entropy solution.

5. Comparison of Naïve-Bayes and Log-linear models

Naïve-Bayes / Log-linear models
Trained to maximize joint likelihood of data and classes. / Trained to maximize conditional likelihood of the classes given the data.
Features assumed to supply independent evidence / Feature weights take feature dependence into account
Features weights can be estimated independently / Feature weights must be mutually estimated
Features must be of the conjunctive form Ф(x)  y = yi / Features need not be in conjunctive form but they usually are.

6. NLP specific issues

Lots of features

Current software packages can handle millions of features.
Storing features arrays can be though expensive (memory wise)

Lots of sparsity

Overfitting is likely to happen – need smoothing!
Many features seen in training will never occur at test time – use feature selection!

Feature weight optimization

Learning can take a long time.

Feature interaction

Maxent models handle overlapping features well, but do not automatically model feature interactions.
In order to enable interaction between features, one needs to explicitly add features that model such interactions (E.g. f1, f2, f1 and f2)

ME classifiers are widely used in a large variety of NLP applications

7. Using conditional probability models in tagging applications

TRAIN(D) is a training algorithm that takes as input a data set (histories; outputs; feature vectors) and returns a parameter vector W. Many software packages implement this function.
PROB returns a probability distribution over possible outputs y given a history x and parameter vector W: P(y | x W) (see [Collins, 2004, Lecture 7], for information on how to compute this value).

Question: how do we implement a TAGGER(S, W) that tags an entire sequence S?

Solution: dynamic programming.

Log-Linear Taggers: Independence Assumptions

The input sentence S, with length n = S.length, has |Т|npossible tag sequences.

Each tag sequence T has a conditional probability

P(T | S) =Πj=1 P(T(j) | S, j, T(1) . . . T(j - 1)) Chain rule

= Πj=1 P(T(j) | S, j, T(j - 2), T(j-1)) Independence assumptions

We have a black-box PROB which can compute the

P(T (j) | S, j, T(j - 2), T(j-1)) terms

How do we compute TAGGER(S, W) where

TAGGER(S, W) = argmax TЄТnP(T | S)

= argmax TЄТnlog P(T | S)

The Viterbi Algorithm

Define n = S.length, i.e., the length of the sentence

Define a dynamic programming table

π [i, t–2 , t–1 ] = maximum log probability of a tag sequence ending

in tags t–2 , t–1 at position i

Our goal is to calculate max t-2, t-1 Є Тπ [n, t - 2, t - 1]

The Viterbi Algorithm: Recursive Definitions

Base case:

π [0,*, *] = log 1 = 0

π [i, t–2 , t–1 ] = log 0 = negative infinity for all other t–2 , t–1

Recursive case: for i = 1 . . . S.length, for all t–2, t–1,

π [i, t–2 , t–1 ] = max tЄТU {*} { π [i – 1, t, t–2] + log P (t–1| S, i, t, t–2) }

Backpointers allow us to recover the max probability sequence:

BP[i, t–2 , t–1 ] = argmax tЄТU{*} { π [i – 1, t, t–2] + log P (t–1 | S, i, t, t–2) }

The Viterbi Algorithm: Running Time

O ( n| τ |3 ) time to calculate log P(t–1 | S, i, t, t–2) for all i, t, t–2 , t–1

n| τ |2 entries in π to be filled in

O ( τ ) time to fill in one entry

(assuming O(1) time to look up log P(t–1 | S, i, t, t–2))

 O ( n| τ |3 ) time

8. Available software packages

Zhang Le: .
Chris Manning and Dan Klein, Stanford University, Java based, several optimization algorithms:
Franz Och, gcc, 132 lines of C:
Jason Baldridge, University of Edinburgh, Java-based:
Rob Malouf, UCSD, several optimization algorithms: (No longer seems to be available).
Hugo WL, Perl:

References

Christopher Manning and Dan Klein. 2003. Optimization, Maxent Models, and Conditional Estimation without Magic. Tutorial at HLT-NAACL 2003 and ACL 2003 ()
Michael Collins. 2003. Lecture Notes 6 and 7 for the 6.891 course, Machine Learning Approaches for Natural Language Processing, MIT. (
Adwait Ratnaparkhi. (1998). Maximum Entropy Models for Natural Language Ambiguity Resolution. Ph.D. Dissertation. University of Pennsylvania (