For English Grapheme-To-Phoneme Conversion

For English Grapheme-To-Phoneme Conversion

Transformation-Based Learning

For English Grapheme-to-Phoneme Conversion

Yuanli Zhou

June 3, 2003

Introduction

This project aims at finding correct pronunciations given words in written English. It does this by using transformational rules that start from the written string and produce the correct sequence of phonemes to be pronounced.

Corpus Materials

CMU Pronouncing Dictionary

Brown Corpus

For the training data, I used a small subset of the Brown corpus, and matched all those words from that subset which also appear in the CMU Pronouncing Dictionary. The total number of word types was 2860. Words with apostrophes (e.g. it's) were considered as one word; however, the apostrophes were discarded in the actual training.

Transformational Models (TMs)

Each transformational rule has four parts: a source S, a target T, a left-context L, and a right-context R, and are denoted by S->T/L_R. When applied on a string W, the rule locates a sequence LSR in W, and changes it to LTR.

Such rules are a natural way of representing English orthography. For instance, c is usually pronounced [s] before the vowels e and i, but [k] elsewhere. We would want to look at the right-context of c to determine the phoneme.

Transformational rules are strictly more powerful than n-gram models: we may consider any n-gram model to be making a single transformation to turn a grapheme into a phoneme, while in a transformational model, several rules may be applied in succession.

Handwritten transformations are a good way to capture the native speaker's intuitions. Most handwritten TMs rely on a number of auxiliary symbols: objects that are no longer graphemes nor yet phonemes, but which will in the end be transformed into phonemes. While it is possible to do this by purely computational means (e.g. by searching over spaces of multiple successive rules), it would seem to be very time and space expensive, and I have opted to ignore auxiliary symbols and search for one rule at a time.

Also, I have restricted left and right context simply to the one phonemic symbol preceding and following the source. See the discussion later for the weaknesses of this approach and possible improvements.

Proposing and Evaluating Rules: A First Algorithm

When used in part-of-speech tagging, there is a clear way to decide which rules to consider and how to rank them: choose those which correct at least one error, and rank according to the net decrease in errors. In comparing sequences of phonemes, it is not as obvious whether a rule corrects an error. We need a metric that compares two phonemic sequences and tells us whether they are close.

For this, I have used the Levenshtein edit distance, which counts the number of additions, deletions, and substitutions needed to change one sequence into another. A more accurate metric would find likely alignments between the sequences and take into account the natural "distance" between phonemes: certainly, it is less drastic to mistake [s] for [z] than for [e].

There is still the problem of deciding which rules to consider in the first place. It is too expensive to try every possible rule: even with reasonable restrictions, such as only allowing vowel-to-vowel and consonant-to-consonant transformations, there are over a million possible rules.

I have chosen to deal with this problem by selecting rules randomly for each training iteration. By choosing 1000 vowel-to-vowel and 1000 consonant-to-consonant rules, it is hoped that at least one of them will be a useful rule. Out of those, the best rule is chosen, as long as it improves by some set threshold the sum of the edit distances over all the words in the training set.

Another problem arises in that the edit distance metric favors deletions. While deletions are necessary -- for example, only one phoneme is represented by doubled consonants or vowels like commit or seed -- once a deletion is made, the lost symbol is irrecoverable. Because the correct sequence of phonemes tends to be shorter than the length (in graphemes) of the word, edit distance can improve (due to similitude in length) even if a useful element is deleted.

To counter this tendency, I have made the restriction that no deletions are to be made in the initial 30 iterations of training. It is hoped that, once the form of a word has stabilized, deletions will no longer incur unintended damage.

These restrictions taken together form my First Algorithm.

Summary of First Algorithm

Map grapheme sequences directly to preset phonemes (e.g. a->AA)

Iterate (until broken):

Select a random sample R of admissible rules

(but no deletions for initial iterations)

For each r in R, apply r to current phoneme sequences P, get r(P).

Score S(r) := sum of edit distances from r(P) to true sequence

If min S(r) improves by a threshold from previous iteration:

Transform P by argmin S(r)

Repeat iteration.

Results

First, a quick note about some preliminary results from a more primitive attempt to do transformations, where I started with simply the list of graphemes and considered transformations to turn the actual grapheme sequence into the correct phonemes. This was simply too slow, and tended to give bad rules such as r->P/p_o, though each iteration (up to at least 50) does improve the average edit distance by about 0.007.

If the graphemes are first mapped to default phonemes, then it is a bit more successful. Due to the restrictions I imposed on admissible rules, the program performs has not performed well; out of 2860 words, and running 100 training iterations, the average edit distance is around 2.93. This does improve with even more iterations, but the average improvement for each iteration is about 0.003.

About a third of the time, an iteration yields no rules at all that improve the accuracy by more than the threshold amount (-5, over 2860 words).

Here are 20 rules found during one run (excluding the initial conversion of all graphemes to a preset phoneme). I have included to the right of each rule a possible example of applying the rule.

(The '#' symbol denotes a word boundary.)

IY->AY/M_N mine

IY->/D_R ??

L->/L_EH parallel

IY->AY/S_D side

EH->/T_R ??

AA->AE/T_N tan

EH->/G_# sage

P->/#_HH phone

EH->/UW_# due

AA->EY/M_T mate

EH->ER/D_R under

P->/P_EH appear

EH->/V_S saves

OW->AO/HH_R horoscope

OW->/IY_N nation

HH->/T_IY this

EH->/EH_P deep

AA->EY/R_T rate

IY->/AA_N pain

K->/AA_EH ??

Most of these rules make some sense. Certain ones, such as the deletion of EH in due (D-UW-EH), seem to apply widely over all English words of that type (e.g. glue, sue). Some would be better as intermediary rules, such as deleting HH in this (T-HH-IY-S), where we would want to end up with a three-phoneme sequence (DH-IH-S).

There are also some inexplicable rules such as K->/AA_EH. There is no conceivable word in which K is deleted between A and E. In fact, it is probably the case that E ought to be silent, so this is a case in which the tendency in using edit distance to shorten phoneme sequences does not give rise to a good rule.

Indeed, after running the algorithm for a while, a disproportional number of the rules found are deletions, and about half of them are not any good. Here are the rules from another run, after about 80 rules had been found before:

OW->/W_R

IH->AH/T_D

EH->ER/HH_R

EH->IY/K_EH

AA->AE/R_N

EH->/N_R

IY->/S_OW

IY->IH/R_N

AA->/HH_P

EH->/Z_#

G->NG/N_#

R->/AA_R

R->/UW_EH

EH->/W_R

P->/#_HH

At this stage we still get some good rules like G->NG/N_#, but the deletions are quickly becoming senseless.

Weaknesses of the First Algorithm and Subsequent Revisions

The determination of phonemes from graphemes in English is very much local: it usually suffices to look just at a small window around each letter to decide how to pronounce it. But it is certainly not so local as to be able to be captured by a window of merely 3 symbols centered at the source. The different qualities of a in rat and rate are determined not by anything in the r-a-t sequence, but by the silent marker e that comes two slots after a.

It would be good, of course, if we can examine all the rules in a large window. But this is too expensive. How can we improve the transformational model without proposing an impractical number of rules? As was seen in the experiments, even looking at just a 3-symbol window gives well over a million rules which cannot reasonably all be tested.

A possible solution is trying to align of phonemic sequences. This would take advantage of the fact that most of the variation in pronunciation comes in the vowels, and if we consider a phoneme sequence as interleaved blocks of consonants and vowels, we can align them treat each block separately. For example, consider the word:

classmates ======(correct)======> K L AE S M EY T S

|| ||

|| ||

(direct map) (blocks)

|| ||

|| [K L][AE][S M][EY][T][S]

K L AE S S M AE T EH S ==(blocks)==> [K L][AE][S S M][AE][T][EH][S]

Aligning the blocks, we can see that for this word we would only need to consider the rules that changed AA to AE, S to zero, AA to EY, and EH to zero; all rules would have a well-defined context.

Another possibility is to use rules with contexts that are more general than specific symbols. In the most general case, we would allow for regular expressions as context. The idea is to allow categories, such as "followed by a consonant, then a vowel", without having to propose every single rule that explicitly states which consonant and which vowel.

A final possibility, related to alignment -- and this is one which I chose to try to implement -- is to compare sequences going left to right. For example, we look at the two sequences for classmates, and notice that the first place in which they differ is S (direct) vs. M (correct). This implies that, in such a context, either S should be transformed to M, or S should be deleted. We may therefore look through all the training words and propose rules in such a manner.

The only difference of this Second Algorithm from the First is in how rules are proposed: they are no longer random.

Proposing Rules Under Second Algorithm

for all words in training set

for i from 0 to minlength(current sequence, correct sequence):

if current[i] != correct[i]

propose: current[i]->correct[i]/current[i-1]_current[i+1]

propose: current[i]->/current[i-1]_current[i+1]

break for loop

repeat until preset number of rules have been proposed

Results of Second Algorithm

After just 10 iterations and considering 500 rules per iteration, the second algorithm reaches an average distance of 2.90, which is better than having run 100 iterations of the first algorithm. After 40 iterations, the average distance is 2.73. After 100 iterations, it improves to 2.41.

Sample rules found by the second algorithm (along with the total edit distance after applying each one in succession) are:

N->NG/IH_G (8611)

G->/NG_# (8429)

EH->/T_R (8379)

EH->/V_R (8342)

S->/S_# (8312)

AE->/EH_R (8283)

EH->T/K_D (8259)

D->/T_# (8236)

K->/K_# (8213)

R->ER/T_# (8194)

S->SH/IH_HH (8178)

N->/N_EH (8165)

There is still a problem with odd vowel deletions, but in the general the performance has improved. In order to make a better system, we will certainly need to have less restrictive rules.

So far, for both the first and second algorithms, all of the transformation rules have been chosen to have one particular element on the left and one particular element on the right. We can generalize this by looking at all three-word windows containing the source. Thus, in each iteration, for each word, four more rules are proposed: in each window, the source may turn into the target or into zero.

A slight disadvantage of this exntension is that it triples the number of rules proposed per word per iteration, so either we would consider fewer rules, or fewer words, in each iteration (if we keep the expected time constant).

Results of Allowing Several Windows

The second through fourth rules on one run are very telling of the power of such a system:

T->SH/_IH,AA (8637)

IH->AH/_AA,N (8522)

AA->/AH_N (8406)

This sequence of rules turns the tion grapheme sequence into SH-AH-N, such as in nation [N-EY-SH-AH-N].

After 10 iterations considering 600 rules per iteration, the system has error 2.88. After 50 iterations, the error decreases to 2.55. After 100 iterations, the error is 2.21.

A Last Tweak

The algorithm as presented so far always proposes rules starting from the top of the list of training words. Thus, if the number of rules tried per iteration is low, the algorithm will overfit the top of that list. We remedy this problem by changing the starting place for each iteration.

When we do this, the learning rate does indeed go up. After 10 iterations, error is 2.76; after 20 iterations, it is 2.57.

Letting the algorithm run until convergence, the result (after 286 iterations) is 1.39 average errors per word.

Possible Further Extensions

A rather important feature of English pronunciation which this system does not attempt to deal with is stressed and unstressed syllables. Incorporating those considerations should improve the performance on identifying phonemes, since English has fairly regular rules for unstressed vowel reduction.