Using Example-based Machine Translation for

English – Vietnamese Translation

Nguyen Minh Quang, Tran Dang Hung

Software Engineering Department, Faculty of Information Technology

Hanoi National University of Education

{quangnm, hungtd}@hnue.edu.vn

2

Abstract

Recently, there is a significant amount of advantages in Machine Translation in Vietnam. Most approaches are based on the combination between grammar analyzing and a statistic-based method or a rule-based method. However, their results are still far from the human expectation.

In this paper, we introduce a new approach which uses the example-based machine translation approach. The idea of this method is that using an aligned pair of sentences which is in Vietnamese and English and an algorithm to retrieve the most similar English sentence to the input sentence from the data resource. Then, we make a translation from the sentence retrieved.

We applied the method to English-Vietnamese translation using bilingual corpus including 6000 sentence pairs. The system approaches feasible translation ability and also achieves a good performance. Compare to other methods applied in English-Vietnamese translation, our method can get a higher translation quality.

I.  Introduction

Machine translation has been studied and developed for many decades. For Vietnamese, there are some projects which proposed several approaches. Most approaches used a system based on analyzing and reflecting grammar structure (e.g. rule-based and copora-based approaches). Among them, the rule-based approach is a trend of direction on this field nowadays; with bilingual corpus and grammatical rules built carefully [7].

One of the biggest difficulties in rule-based translation as well as other methods is data resources. An important resource that is required for translation is the thesaurus which needs lots of effort and work to build [9]. This dataset, however, do not meet the human’s requirements yet. In addition, almost traditional methods also require knowledge about languages applied so it takes time to built a system for new languages [5, 6].

The Example Based Machine Translation (EBMT) is a new method, which relies on large corpora and tries somewhat to reject traditional linguistic notions [5]. EBMT systems are attractive in that output translations should more sensitive to contexts than rule-based systems, i.e. of higher quality in appropriateness and idiomaticity. Moreover, it requires a minimum of prior knowledge beyond the corpora which makes the example set, and are therefore quickly adapted to many language pairs [5].

EBMT is applied successfully in Japanese and American in some specific fields [1]. In Japanese, they built a system achieving a high-quality translation and also an efficient processing in Travel Expression [1]. In Vietnamese, however, there’s no research following this method although the fact is that to apply in English-Vietnamese translation, this method doesn’t require too many resources and linguistic knowledge. We only have a English-Vietnamese Corpus dataset in Ho Chi Minh National University – the significant data resource with 40.000 pair of sentences (in Vietnamese and English) and about 5.500.000 words [8]. We already have the English thesaurus and English-Vietnamese dictionary. About the set of aligned corpora, we have made 5.500 items for the research.

In this paper, we use EBMT knowledge to build a system for English-Vietnamese translation. We will apply graph based method [1] to Vietnamese language. In this kind of paradigm, we have a set, each item in this set is a pair of two sentences: one in the source language and one in the target language. From an input sentence, we carry out from the set a item which is the most similar sentence to the input. Finally, from the example and the input sentence, we adjust to provide a final sentence in target language. Unfortunately, we don’t have a Vietnamese, thesaurus so we proposed some solutions for this problem. In addition, this paper proposes a method to adapt the example sentence to provide the final translation.

1. EBMT overview:

There are 3 components in a conventional example based system:

- Matching Fragment Component.

- Word Alignment.

- Combination between the input and the example sentence carried out to provide the final target sentence.

For example:

(1) He buy a book on international politics

(2) a. He buys a notebook.

b. Anh ấy mua một quyển sách.

(3) Anh ấy mua một quyển sách về chính trị thế giới.

With the input sentence (1), the translation (3) can be provided by matching and adapting from (2a, b).

One of the main advantages of this method is that, we can improve the quality of translation easily by widen the amount example set. The more items add, the better we have. It’s useful to apply for a specific field because the limit of form of the sentence included in these fields. For example, we use it to translate manuals of product, or weather forecast, or medical diagnosis.

The difficulty to apply EBMT in Vietnam is that, there’s no word-net in Vietnamese, so we promote some new solutions to this problem.

We build a system with 3 steps:

- Form the set of example sentences, the result is the set of graphs.

- Carry out the most popular example sentence to the input sentence. From an input sentence, using “edit distance” measuring, the system will find sentences which is the most similar to it. Edit-distance is used for fast approximate between sentences, the smaller distance, the greater similarity between sentences.

- Adjust the gap between the example and the input.

2.  Data resource:

We use 3 resources of data. That is:

-  Bilingual corpus: this is the set of example sentences. This set includes pairs of sentences. Each sentence is performed as a word sequence. Spreading the size of the set will improve the quality of translation.

-  The Thesaurus: A list of words showing similarities, differences, dependencies, and other relationships to each other.

- Bilingual Dictionary: We used the popular English Vietnamese dictionary file provided by Socbay company.

3.  Build the graph of example set.

The sentences are word sequences. We divide the words into 2 groups

- Functional word: Functional words (or grammatical words or auto-semantic words) are words that have little lexical meaning or have ambiguous meaning, but instead serve to express grammatical relationships with other words within a sentence, or specify the attitude or mood of the speaker.

- Content word: Words that are not function words are called content words (or open class words or lexical words): these include nouns, verbs, adjectives, and most adverbs, although some adverbs are function words (e.g., then and why).

We classify the set into sub set. Each set includes sentences with the equal amount of content words and functional words.

Based on the division, we build a group of graphs – word graphs:

- They are directed graphs including start node and goal node. They includes nodes and edges, an edge is labeled with a word. In addition, each edge has its own source node and destination node.

- Each graph performs a sub set. Each sub set includes sentences with the same total of content word and the same total of functional word.

- Each path from start node to goal node performs a candidate sentence. To optimize the system, we have to minimize the size of word graph. Common word sequences in different sentences use the same edge.

Figure 1: Example of Word Graph

The word graphs have to be optimized with the minimum number of node. We use the method of converting finite state automata [3, 4].

After preparing all resources for this method, we will execute 2 steps of it: example retrieval and adaption:

4.  Example retrieval:

We use the A*Search algorithm to approach the most similar sentences from word graph. The result of matching between two word sequences is a sequence of substitution, deletions and insertions. The search process in a word graph is to find a least distance between the input sentence and all the candidates perform in graph.

As a result, matching sequences of path are approached as records which include a label and one or two words.

Exact match: E(word)

Substitution: S(word, word)

Deletion: D(word)

Insertion: I(word)

For example: Matching sequence between the input sentence We close the door and the example She closes the teary eyes is:

S(“She”, “We”) – E(“close”) – E(“the”) – I(“teary”) – S(“eyes”).

The problem here is that we have to pick a sentence with the least distance to the input sentence. We firstly compare the total of E-records in each matching sequence, then we compare S-records and so on.

5.  Adaption:

From the example approached, we adapt it to provide the final sentence in target language for input sentence by insertions, substitutions and deletions. To find the meaning of English words, we used morphological method.

5.1. Substitution, deletion and exaction:

We will find the right position for the word in the final sentence for substitution, deletion and exaction. With deletion, we do nothing, but the problem here is that we have to find to meaning of word in substitution and deletion records.

- There are some different meanings of a word, which one will be chosen?

- Words in the dictionary are all in infinitive form while they can change to many other forms in the input sentence.

We help to solve this problem carefully. Firstly, we find the type of word (noun, verb, adverb, …) in the sentence. We use Penn Tree Bank tagging system to specify the form of each word. Secondly, based on the form of word, we seek the word in the dictionary:

If the word is plural (NNS):

- If it ends with “CHES”, we try to delete “ES” and “CHES”, when the deletion makes an infinitive verb; we find the meaning in dictionary. Other case, it is specific noun.

- If it ends with “XES” or “OES”, we delete “XES” or “OES” and find the meaning.

- If it ends with “IES”: replace “IES” by “Y”.

- If it ends with “VES”: replace “VES” by “F” or “FE”.

- If it ends with “ES”: replace “ES” by “S”.

- If it ends with “S”: delete “S”.

After finding the meaning of plural, we add “những” before its meaning.

If the word is gerund:

- Delete “ING” at the end of the word. We try two cases. First is the word without “ING” and second is the word without “ING” and with “IE” at the end.

If the word is VBP:

- If the word is “IS”: it’s “TO BE”. If it ends with “IES”: replace “IES” by “Y”

- If it ends with “SSES”: erase “ES”

- If it ends with “S”: erase “S”

If the word is in the past participle or past form:

- Check the word if it’s included in the list of irregular verb or not. If it’s included, we use the infinitive form to find the meaning. The list of irregular verb is performing as red-black tree to make the search easier and faster.

- If it ends with “IED”: erase “IED”.

- If it ends with “ED”: check the very last 2 letter before “ED”, if they are identical then we erase 3 last letter of word. Other wise, we erase “ED”.

If the word is in present continuous form, we find the word in the same way with gerunds. After that we add “đang” after the meaning.

- If the word is JJS: Delete 3 and 4 last 4 consonant and find the meaning in the dictionary.

After infinitive form of word is found, we use bilingual dictionary to seek the meaning.

The problem is that, when we reach the infinitive form of word, since there are many meanings with a kind of words, we have to choose the right one. In our experiment, we take the first meaning in the bilingual dictionary.

5.2. Insertion:

The problem here is that we don’t know the exact position to fill the Vietnamese meaning. If we choose the position as the position of

Insertion record in matching sequence, the final sentence in Vietnamese will be in low quality. We have to use the theory of ruled-based machine translation to solve this problem. We can use it in some specific phrase to find the better position instead of the order of records.

Firstly, link grammar system will parse the grammatical structure of sentence. The Link Grammar Parser is a syntactic parser of English, based on link grammar, an original theory of English syntax. Given a sentence, the system assigns it a syntactic structure, which consists of a set of labeled links connecting pairs of words. The parser also produces a "constituent" representation of a sentence (showing noun phrases, verb phrases, etc.).

+-----O-----+

+-D-+--S--+ +--D--+

| | | | |

The boy painted a picture

From the grammatical structure of sentence, we find out some phrases in English which need to change the order of word to translate into Vietnamese. For example, the noun phrase “nice book”, with 2 I-records: I(nice) and I(book), we used to translate into “hay quyển sách” instead of “quyển sách hay”. With link grammar, we know the exact order to translate. Some phrases to process:

2

Table 1: Some phrase to process with Link Grammar

1 / Noun phrase: POS(1, 2) = ({JJ}, {NN}). Reorder: ({NN}, {JJ})
2 / Noun phrase: POS(1, 2, 3) = ({DT}, {JJ}, {NN}) & word1 = this, that, these, those. Reorder: ({NN}, {JJ}, {DT})
3 / Noun phrase: POS(1, 2) = ({NN1}, {NN2}). Reorder: ({NN2}, {NN1})
4 / Noun phrase: POS(1, 2) = ({PRP$}, {NN}). Reorder: ({NN}, {PRP$})
5 / Noun phrase: POS(1, 2, 3) = ({JJ1}, {JJ2}, {NN}). Reorder: ({NN}, {JJ2}, {JJ1})

2