Hebrew vowel restoration using a bigram HMM model

Kobi Gal

Department of Engineering and Applied Sciences

Harvard University

<>

CS 287r – Natural Language Processing

May 16th, 2001

Abstract

We present the problem of vowel restoration in Hebrew and attempt to solve it by using a unigram base-line model and a bigram HMM model while using Katz's method for integrating smoothing and discounting. We achieve 68% word accuracy with our base-line and 80% word accuracy with the HMM model. Most of the errors are due to unseen bigrams and unigrams and the high number of hidden states consisting of low counts. We conclude that some degree of morphological analysis would greatly assist us in improving word accuracy.

1. Introduction -

In modern Hebrew, written texts contain only consonants. Hebrew has a rich morphology, and due to missing vowels considerable ambiguity exists already at the word level. This is because many words with different vowel structures may contain the same consonants in a vowel-less setting. For example, the non-voweled word ספר , written in Latin transliteration as SPR[1] may represent the noun “book” (pronounced /sepher/[2]), the verb “to count” /saphar/ in 3rd person singular form, and at least 4 other interpretations[3]. This is further complicated by the fact that verb conjugations, connectives and short particles in Hebrew are all attached to the word as prefixes or suffixes, resulting in further ambiguity. For example, short particles, such as the definite article ha- (=the), represented by the letter ה, the connective w (=and), represented by the letter ו, and the subordinator $e (=of, which), represented by the letter ש, and many short prepositions, are all written with no space before the following word. The result is that strings of letters may represent one word or a combination of a particle and a word or even several particles or a word (Ornan et al., 95). For example, the wordברכות ,written in Latin transliteration as BRXWT, may represent the Hebrew noun “congratulations” /brahot/, or the Hebrew noun plus proposition “with softness” /be-rakut/. The word היתר can represent the Hebrew noun “permit” /hey-ter/ or the Hebrew noun plus definitive “the rest” /ha-yeter. When faced with modern Hebrew script, a reader is faced with a double problem. He must not only recognize each word in order to be able to vocalize it, but when the consonant strings are ambiguous, he must choose the right one in context. This is hardly a simple task, for the number of ambiguous representations of words may be fairly large. Amazingly, in most cases a native speaker of the language can accurately vocalize each word. They do it on the basis of their control of the grammar, on their acquaintance with the lexicon, on their knowledge of semantic connections, and by understanding the real world. In this paper, we propose to devise a model that would employ contextual information and attempt to restore the proper vowel structure of the word. Note that this is a harder problem than restoring the vocal form of the word, as many different vowel forms sound the same[4]. For applications such as automatic speech synthesizers, merely restoring the vocal pattern of the sentence would be sufficient. However, we wish to fully annotate the vowel pattern of the word, so that non-voweled text may be restored to its original form, and benefit foreign speakers, sufferers of dyslexia, children learning the language.

2. Previous and related work

I have located a commercial product named "Nakdan Text" (Choueka, 89). Given a sentence in modern Hebrew, this program will restore its vowels by first activating a function that finds all possible morphological analyses and attached vocalizations of every word in the sentence. Then, "Nakdan Text" chooses for every such word the correct context-dependent vowel pattern using short-context syntactical rules as well as some probabilistic and statistical modules. The authors claim to achieve 95% “correctness” in restoring vowel patterns. It is not clear if this refers to word accuracy or letter accuracy[5]. This program was demonstrated at BISFAI-95, the fifth Bar Ilan international symposium on Artificial Intelligence, but no summary or article was included in its proceedings, and to the best of my knowledge no article has ever been published describing this work. Also, my several attempts at contacting the authors have been unsuccessful.

Yarowsky (Yarowsky, 92) has implemented several techniques for the restoration of vowel patterns in French and in Spanish. The techniques in question were Bayesian classification and decision lists, the latter method producing the best results. Decision lists are a hybrid approach combining the strengths of both HMM's and Bayesian Classifiers, taking into account both local and long term dependencies between words as well as integrating different types of rules. We believe decision lists to be applicable to the Hebrew vowel restoration problem and in future work would like to compare the performance of these with our bigram HMM model. [6] Levinger and Ornan (Levinger and Ornan, 91) have conducted research in the related field of morphological analyzers of Hebrew. They attempt to decompose a word to its distinct features, such as part of speech, lexical entry, tense (verbs only), gender, number, and person of the word as well as gender, number, and person of any pronoun suffixes. For example, the morphological analyses of the Hebrew string וכשבאתי , transliterated in Latin as WK$B^TY, and pronounced /ve-kshe-bati/ is as follows –

· lexical entry : the verb בא , translated in latin as B^, (=to come)

· category – verb

· tense – past

· feminine/masculine, first person singular

· object pronoun – masculine, singular, 3-rd person

Clearly, full morphological analyses of Hebrew would also solve the vowel restoration problem because it would disambiguate any word and we could simply look up the appropriate meaning in the dictionary.

Segel (Segel, 97) devised a statistical Hebrew lexical analyzer that inputs non-voweled Hebrew texts and achieves 95% word accuracy on test data extracted from the Israeli newspaper Ha-aretz. However, this method requires fully analyzed Hebrew text to train on. The author used a hand-analyzed training set consisting of only 500 sentences. Because there is no tree bank in Hebrew of analyzed text this method would not carry well to other domains, such as novels or medical texts. To this date, no statistical language model of Hebrew has used an N-gram model[7].

3. Our Approach

We have stated above that contextual information is important in deciphering lexical ambiguities in Hebrew and is commonly used by native speakers. We believe that an HMM bigram would sufficiently capture the contextual dependencies between words. We initially considered using a trigram approach but decided that the added cost in development time brought about by the increase in the number of hidden states and the complex smoothing techniques exceeds the limit for this project.

We will evaluate performance by checking the percentage of words in the training set containing the correct vowel pattern. We refer to this performance measure as word accuracy percentage. Because our primary purpose of is to assist in the reading of modern Hebrew we will not check the percentage of correctly annotated syllables/letters[8].

4. Corpus – The Westminster Hebrew Morphological Database

There is an unfortunate lack of data for vowel-annotated text in modern Hebrew. An obvious substitute is the Hebrew bible. Ancient Hebrew bears enough syntactical and semantic resemblance to modern Hebrew to warrant usage of a bible corpus. The Westminster Hebrew Morphological Database is a corpus carrying a complete transcription of the graphical form of the Massoretic text of the Hebrew bible and contains 301,2670 words. The text is coded in Latin letters. Apart from the vowels, the corpus contains the following: open and closed paragraph marks, the accentuation signs, and morphological divisions for certain morphemes. The code for these different details is interspersed with one another. Ninety percent of the corpus was used as training data, and 10 percent of it as test data. Due to the relatively small size of the training data, multifold testing and training of different segments of the data might improve robustness of results, but due to the timing constraints they were not used.

5. Base line – A unigram model

A frequency table was established including a separate entry for each vowel-annotated word in the training data. We counted the number of times each vowel-annotated word appeared in the set. In the testing phase, we search through all of the words that have the same consonant structure in the table and pick the vowel-annotated word that has the highest count. The following chart describes ambiguity distribution among the training set.

Note that only 33% of the training set was unambiguous. We achieved an overall success rate of 68% using the base line technique (see results section).

6. Bigram model

We constructed a bigram Hidden Markov Model where hidden states were vowel-annotated words, and visible states were vowel-less words. One example of a path in the HMM is given in chart 2. In our model, there was only one emission from each hidden state to an observation. Therefore the probability of sentence W1,n is the sum over all hidden states of finding the hidden states the HMM traversed while generating the sentence w1,n starting at a particular hidden state.

These probabilities are approximated by bigram counts (see chart 2). Note that two special symbols, shown as # #, were prepended to words in the test set. They serve to “anchor” the initial state of the HMM and facilitate computation.

Chart 2 – HMM model for the non-vowel-annotated phrase “in the beginning god created…” בראשית ברא אלוהים , transliterated in Latin as BR^$YT -BR^ - ^LWHYM dan pronounced as /be-reshit bara elohim/

Therefore, the hidden states actually consist of vowel-annotated bigrams.

The probability of one possible transition of generating this phrase can be computed as follows : which decomposes into the following probability estimations.

Note that in chart 2, there exist transitions to hidden states that include missing bigrams. In the model, each state has a possible transition to a missing bigram state. We elaborate more on this issue in section 7. We implemented a Viterbi algorithm to find the most likely path transitions throughout the hidden states. More formally, we define the problem as Hebrew vowel restoration using a bigram HMM model as finding

We achieve this by keeping track of the most likely path through the hidden states for each possible hidden state with a non zero probability estimation and each observation.

7. Sparse data

Because our bigram model is trained from a finite corpus, many bigrams are likely to be missing from it. In the unigram model, we found that as many as 16 percent of words in the test set were not to be found in the count look up table. The amount of unseen bigrams was even higher, as much as 20 percent. This is not surprising, as we expect some unseen bigrams to consist of words that were both seen before individually. We did not specifically deal with sparse data in the unigram base line model. As many of the unseen unigrams were non ambiguous, we would have liked to look up the sparse words in the Hebrew vowel-annotated dictionary and copy the vowel pattern found in the dictionary. However, as noted in section 2, Hebrew words are attached with prefixes and suffixes that represent conjugation, propositions, pronouns, and connectives. The dictionary contains only the stem form of verbs and nouns, and without a morphological analyzer we cannot decipher the stem. Therefore we proceed as follows: We employ Katz’s technique (Katz, 99) to combine a discounting method along with a backoff method to try and obtain a good estimate of unseen bigrams. We use the Good and Turing discounting method [Gale & Sampson 91] to tell us how much total probability mass to set aside for all the events we haven’t seen, and the backoff algorithm to tell us how to distribute this probability. Formally, we define

where Pd is the discounted estimate using the Good and Turing method, p is a probability estimated by counts and is a normalizing factor that divides the unknown probability mass of unseen bigrams beginning with w1.

In order to compute Pd we create a separate discounting model for each context word w1. The reason for this is simple: If we use only one model over all of the bigram counts, we would be really approximating Pd(w2,w1). Because we wish to estimate Pd(w2|w1), we define the discounted frequency counts as follows –

where is the number of different bigrams in the corpus that have frequency c. Following Katz, we estimate the probability of unseen bigrams to be

p(w2|w1)

Note that in some cases, w2 itself is unseen and c(w2) cannot be computed. To get the estimate probability for unseen unigrams p(unseen|w1) we allocate some probability mass to unseen w2 words by keeping a special count for bigrams (see chart 2) that were seen less then k times.[9]

8. Results and discussion

As stated previously, the base line unigram model achieved a word accuracy of 68%. Using the HMM bigram model, we managed to achieve word accuracy of 80%. This is an improvement over our base line, but these results are unsatisfying, as they mean that on average, our model misclassifies 2 words in every sentence. Most of the errors (16%) are attributed to the misclassification of unseen words. As stated in section 2, some degree of morphological analysis will help us to classify many of these words more accurately, Another problem of the model is the large number of hidden states that correspond to vowel-annotated words seen in the training data. This phenomenon is attributed to the following aspects of the morphological richness of Hebrew: