A Corpus Based Morphological Analyzerfor Unvocalized Modern Hebrew

Alon Itai and Erel Segal

Department of Computer Science

The Technion—Israel Institute of Technology

Haifa, Israel

Abstract

Most words in Modern Hebrew texts are morphologically ambiguous. Some words have up to 13 different analyses, and the average number is 2.4. The correct analysis of a word depends on its context. We describe a method for finding the correct morphological analysis of each word in a Modern Hebrew text. The program first uses a small tagged corpus to estimate the probability of each possible analysis of each word regardless of its context and chooses the most probable analysis. It then applies automatically learned rules to correct the analysis of each word according to its neighbors. Lastly, it uses a simple syntactical analyzer to further correct the analysis, thus combining statistical methods with rule-based syntactic analysis. It is shown that this combination greatly improves the accuracy of the morphological analysis—achieving up to 96.2% accuracy.

The software described in this article is available at Segal (2001).

1.Introduction

The morphological analysis of words is the first stage of most natural language applications. In Modern Hebrew (henceforth Hebrew), the problem is more difficult than in most languages. This is due to the rich morphology of the Hebrew language and the inadequacy of the common way in which Hebrew is written—the unvocalized script—which results in a great degree of morphological ambiguity. The average number of analyses of a word is 2.4, and in some extreme cases can reach 13 analyses. The problem has not yet found a satisfactory solution (see Levinger (1995)).

In Hebrew, the morphological analysis partitions a word token to morphemes; one of which contains the lexical value, while others contain linguistic markers for tense, person, etc., and others represent short word, such as determiners, prepositions and conjunctions, prepended to the word token. The part-of-speech (POS) of the lexical value and the linguistic markers of a word define a tag. These tags correspond to POS tags in English.

POS tagging in English has been successfully attacked by corpus based methods. Thus we hoped to adapt successful part-of-speech tagging methodologies to the morphological analysis of Modern Hebrew. There are two basic approaches: Markov Models (Church (1988), Brants (2000)) and acquired rule-based systems (Brill (1995)). Markov Model based POS tagging methods were not applicable, since such methods require large tagged corpus for training. Such corpora do not yet exit. Moreover, the fairly free word order of Hebrew makes it more difficult to apply bigram based methods. We preferred, therefore, to adapt Brill’s rule based method, which requires a small training corpus. Brill's method starts with assigning to each word its most probable tag, and then applies a series of “transformation rules”. These rules are automatically acquired in advance from a modestly sized training corpus, see Section 4.

In this work, we find the correct morphological analysis by combining probabilistic methods with syntactic analysis. The solution consists of three consecutive stages:

1.The word stage: In this stage we find all possible morphological analyses of each word in the analyzed text. Then we approximate, for each possible analysis, the probability that it is the correct analysis, independent of the context of the word. For this purpose, we use a small analyzed training corpus. After approximating the probabilities, we assign each word the analysis with the highest approximated probability [this stage is based on Levinger (1995)].

2.The pair stage: In this stage we use transformation rules, which correct the analysis of a word according its immediate neighbors. The transformation rules are learned automatically from a training corpus (this stage is based on Brill (1995)).

3.The sentence stage: In this stage we use a rudimentary syntactical analyzer to evaluate different alternatives for the analysis of whole sentences. We use a hill-climbing algorithm to find the analysis which best matches both the syntactical information obtained from the syntactical analysis and the probabilistic information obtained from the previous two stages.

The data for the first two stages is acquired automatically, while the sentence stage uses a manually created parser.

Using all these three stages results in a morphological analysis, which is correct for about 96% of the word tokens. This result approaches results reported for English probabilistic part-of-speech tagging. It does so by using a very small training corpus – only 4900 words, similar in size to the corpus used by Brill and is much smaller than the million-word corpora used for HMM based POS tagging of English. The results show that combining probabilistic methods with syntactic information improves the accuracy of morphological analysis.

In addition to solving a practical problem of Modern Hebrew and other scripts that lack vocalization (such as Arabic, Farsi), we show how several learning methods can be combined to solve a problem, which cannot be solved by any one of the methods alone.

a.Previous Work

Both academic and commercial systems have attempted to attack the problem posed by Hebrew morphology. The Rav Millim (MATAH) commercial system provided a morphological analyzer. Within a Machine Translation project, IBM Haifa Scientific Center provided a morphological analyzer (Ben-Tur et al. 1992), that was later used in several commercial products. The sources are proprietary so we used Segal’s publicly available morphological analyzer (Segal 2000).

Other works attempt to find the correct analysis in context. Choueka and Luisignan (Chueka and Lusingnan, 1985) proposed to consider the immediate context of a word and to take advantage of the observation that quite often if a pair of adjacent words appears more than once the same analysis will be the correct analysis in all these contexts. Orly Albeck (Albeck ) attempted to mimic the way humans analyze texts by manually constructing rules that would allow to find the right analysis without backtracking. Levinger (Levinger 1992) gathered statistics to find the probability of each word, and then used hand crafted rules to rule out ungrammatical analyses.

2.Modern Hebrew Morphology

a.The problem

This section follows Levinger (1995).

Due to its dimension morphological ambiguity is a severe problem in Hebrew. Thus, finding methods to reduce the morphological ambiguity in the language is a great challenge for researchers in the field, and forpeople who wish to develop natural language applications for Hebrew.

Number of Analyses / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / 11 / 12 / 13
Number of Word-Tokens / 17551 / 9876 / 6401 / 2493 / 1309 / 760 / 337 / 134 / 10 / 18 / 1 / 3 / 5
% / 45.1 / 25.4 / 16.5 / 7.1 / 3.37 / 1.27 / 0.87 / 0.34 / 0.02 / 0.05 / 0.002 / 0.007 / 0.01

Table 1: The dimension of morphological ambiguity in Hebrew

Table 1 demonstrates the dimension of the morphological ambiguity in Hebrew. The data was obtained by analyzing large texts, randomly chosen from the Hebrew press, consisting of nearly 40,000 word-tokens. According to this table, the average number of possible analyses per word-token was 2.1, while 55% of the word-tokens were morphologically ambiguous. The main reason for this amount of ambiguity is the standard writing system used in Hebrew (unvocalized script). In this writing system, not all the vowels are represented, several letters represent both consonant and different vowels, and gemination is not represented at all. In Hebrew a word token consists of several morphemes that represent lexical values; functional words such as prepositions and pronouns; and linguistic markers such as articles, gender, number and possessive cases. The morphemes undergo morphological transformations that further add to the ambiguity. To demonstrate the complexity of the problem, we should take a closer look at Hebrew words and morphology.

In order to overcome technical difficulties, we write Hebrew texts using a Latin transliteration, in which each Hebrew letter is represented by a single Latin letter or a single special symbol. This is a 1-1 transliteration and not a phonologic transcription. See appendix A for the details of this transliteration.

For example, the morphological analysis of the word token וכשראיתיtranscribed as WK$RAITI, (pronounced kshera’iti) is as follows:

W + K$E + RAH + ANI

W / = the conjunctive “and”
K$E / = “when”
RAH / = the lexical entry (ראה -- the verb 'to see', past tense)
ANI / = First person singular pronoun

Thus, WK$RAITI should be translated into English as: 'and when I saw'.

In general, a morphological analysisof a Hebrew word should extract the following information:

Lexical entry.
Part of speech.
Attached particles (i.e. connectives, prepositions, determiners).
Gender and number.
Status -- a flag indicating whether a noun or adjective is in itsconstruct or absolute form.
Person (for verbs and pronouns).
Tense (for verbs only).
Gender, number and person of pronoun suffixes.

The above word token is non-ambiguous: it has exactly one morphological analysis. But most Hebrew strings have more than one analysis. Consider, for example, the following word token – HQPH (הקפה), which has three possible analyses:

The definite article H + the noun QPH (ה-קפה– the coffee).
The noun HQPH (הקפה– encirclement).
The noun HQP + the feminine possessive suffix H (הקף-ה– her perimeter).

For another example, consider the word token '$QMTI', which can be analyzed both as a noun and as a verb:

The noun '$QMH' (sycamore) with a first person possessive suffix 'I' ($QMT-I = my sycamore; here we witness a morphological transformation by which the letter H at the end of a feminine noun becomes a T when a suffix is added).
The connective '$' (that) + the past-tense verb 'QM' (got up), + TI first person singular suffix, ($-QMTI = that I got up).

Note that in a given context only one of the possible analyses is correct, and a native speaker easily identifies it. For example, in the sentence:

@IPSTI &L $QMTI

(I-climbed [on] my-sycamore)

the string $QMTI is analyzed as a noun ($QMT-I), and in the sentence:

AMRTI LW $MQTI HBWQR MWQDM

(I-said to-him that-I-got-up this-morning early)

it is analyzed as a verb ($-QMTI).

b.The basic morphological analyzer

A basic morphological analyzer is a function that gets a word token and returns the set of all its possible morphological analyses. Several basic morphological analyzers for Hebrew have been developed over the last decade. We used a basic morphological analyzer that is public domain[1], and supplies all morphological information described in the previous section except the object-suffix for verbs, which is rare in Hebrew. (Object suffix appeared only twice in a 4900 word corpus of a Hebrew newspaper.)

3.The word stage

a.The mathematical model

The following description follows Charniak et al (1993).

Our purpose in this project is to find the most probable morphological analysis for a given sentence. In order to give a more accurate meaning to the notion of "most probable analysis", let us view each word in a sentence as a random variable, whose domain is the set of all Hebrew word tokens. A sentence is a series of such random variables:

W[1..n]=W[1]W[2]W[3]...W[n]

Similarly, a morphological analysis of a word is a random variable, whose domain is the set of all its morphological analyses. A morphological analysis of a sentence is a series of such random variables:

T[1..n]=T[1]T[2]T[3]...T[n]

T[i] represents the morphological analysis of word i.

As usual, we represent random variables by capital letters and a specific value of a variable by a non-capital letter. Thus, the input to our problem is a series of values for the random variables representing the sentence, namely: w[1..n]. The output is a series of values for the random variables representing the morphological analyses, namely: t[1..n].

We look for the most probable morphological analysis of the given sentence, which is:

(Where the maximization is done over all possible morphological analyses of the sentence).

This is a very general formula: it assumes that the analysis of each word can depend on all other words in the sentence. Therefore, it is not possible in practice to use this formula directly. Some further assumptions are needed in order to make the calculations practical.

In the word stage, we make the assumption that the analysis of each word depends only on the word itself and not on the other words in the sentence. Using this assumption, Charniak et al (1993) show that the formula above becomes:

This means that for each word w[i] we should choose the most probable analysis for w[i]:

By Bayes formula, this means that for each word w:

We dropped the denominator P(w) because it is the same for all analyses t.

Except for a small number of exceptions[2], for each analysis t there is exactly one word-token w, for which it can be the correct analysis. For example, if the analysis t is "HLK, past-tense verb, third person singular feminine" (=she went), the word w must be "HLKH"2.

Therefore,

P(t, w) = P(t)

Hence, for each w:

To use this formula, we can calculate P(t) for each possible analysis t of each word w in the text. For example, for the word “$QMTI” we should find the probabilities:

P(“$-QMTI”) -for the analysis as a verb.
P(“$QMT-I”)–for the analysis as a noun.

b.Acquisition of the probabilities

We cannot calculate the required probabilities theoretically because we do not have a probabilistic model of the Hebrew language. However, we can calculate an empirical probability, which will estimate the true probability. We estimate these probabilities by counting:

where C(t) is the number of times the word w was seen in the training corpus with the analysis t, and N is the number of words in the training corpus.

This formula assigns zero probability to analyses that do not appear in the training corpus, although their true probability may be positive. This is a problem in many statistical works, and it was a severe problem here, because of the small size of the training corpus (only 4900 words). For comparison, works on HMM based methods for English part-of-speech tagging use million-words corpora. Therefore, many analyses did not appear in the training corpus, although their probability is larger than 0. However, even if we had a larger corpus many words would be absent. This is due to the large number of conjugations in which each word can appear: In Hebrew each noun has 24 forms, and each verb has over 30; added particles greatly increase the number of word tokens. (In English each noun has two conjugations (singular and plural) and each verb at most 4). Thus, the number of Hebrew word tokens is much larger than in English. This raises a sparseness problem: the probability of encountering a yet unseen word token is non-negligible. English corpus based methods treat each conjugation as a separate word — due to the sparseness, such an approach is impractical for morphologically rich languages, such as Hebrew.

Manning and Schutze (1999, chapter 6) describe many approaches to the sparse data problem. We addressed this problem in two complementary ways. First, we gathered information about every morpheme of an analysis t. For example, we consider the lexical value (tlv) of t and its other morphemes (tmf), [אס1]such that:

Assuming that the lexical value is statistically independent of the other morphemes, we can write:

The statistical independence assumption is, of course, not always true, it is used only to get an estimation.

Returning to the above example, we get:

($QMT-I) = ($QMH) (noun-with-first-person-singular-possessive-suffix)

($-QMTI) = (QM) ($+verb-past-first-person-singular)

The above probabilities can also be estimated by counting:

($QMH) = C($QMH)/N, P(QM) = C(QM)/N, ...

Obviously, there is much more information about lexical values alone and morphemes alone than there is about whole analyses. However, using as small a corpus as we did -- there were still many lexical values and morphemes with zero counts.

To overcome this problem, we estimated the probabilities of both the lexical values and the morphemes using the Good-Turing formula (first described by Good (1953)).

To get an intuitive understanding of the Good-Turing formula, suppose we divide our tagged corpus, which has N words, into two parts: a training part with words and a test part with a single word. We then use this partition to estimate the probability that a lexical value (or a set of morphemes), which was unseen in the training part, will be seen in the test part. It is easy to see that this probability will be 1 if the test part happens to contain a word whose lexical value appears in the entire tagged corpus exactly once. The probability will be 0 otherwise.

We can repeat this experiment N times, each time selecting another word for the test part, and average the results of all N experiments. We will get:

where is the number of lexical values which occur exactly once in the tagged corpus, and is the probability that a new word will be have a lexical value which was unseen in the tagged corpus.

In a similar manner it may be argued, that the probability that a new word will have a lexical value which was seen r times in the tagged corpus can be estimated by:

where is the number of lexical values that appeared times in the tagged corpus.

The theoretical Good-Turing formula is more complicated: theoretically, we should use the expectations of the ’s and not the ’s themselves: