Estonian specific enhancements which could be used in statistical and hybrid MT systems.R&D report.

Draft Dec 15th 2014

Project name: Linguistic Knowledge in Estonian Machine Translation

Project acronym: EKT63

1.Introduction

We describe our work done in 2014 (morphologytoolkitintegration, bilingualdictionaries etc) and list ideasforsubsequentyears.

2.State of the art

Recent advances in MT

Since beginning of this century statistical machine translation (SMT) has become dominant approach in machine translation. While at the beginning it has been applied mostly to widely spoken languages, .e.g., English, Spanish or French, with the growth of size of the language resources it becomes more popular for smaller languages, including languages of Baltic countries - Estonian, Latvian and Lithuanian (e.g. Fishel et al. 2007, Skadiņa 2008).Recently SMT as practical approach has been accepted by European Commission where MT@EC[1] is implemented aiming at providing easy access to European Public Services information across Europe in user’s mother tongue.

The most popular approach used in SMT systems is phrase-based SMT (Koehn et al. 2003), where sequences of words (called phrases) are retrieved automatically from parallel texts. Such approach has shown rather good results for languages with simple morphology, while for languages with rich morphology and rather free word order this approach has shown several important deficiencies that in many cases make output of general SMT system incomprehensible even for gisting purposes. Manual analysis of English-Latvian SMT output (Skadiņa et al. 2012) has reviled main problems with phrase-based SMT when it has been applied to morphology rich language, i.e., almost half of sentences has incorrect word forms, in many cases word order is incorrect.

As result a lot of research has been performed to incorporate linguistic knowledge into SMT. Among them, the most popular are factored models (Koehn and Hieu 2007) thatallows to incorporate different morphosyntactic properties. Application of factored models for translation into Baltic languages (Latvian and Lithuanian) has shown some improvements where it concerns translation quality (e.g. Skadiņš et al. 2010).Similarly,Clifton and Sarkar (2011) achieve best published result onEnglish-Finnish task using an unsupervised morphologicalsegmenter with a supervised post-processing merge step.

Another proposal is to apply syntax-based models. Moses framework (Koehn et al. 2007) supports tree-to-string, string-to-tree or tree-to-tree models. One of the limitation for this approach is availability of parser. Initial experiments with tree-to-string models in Moses framework for English-Latvian has shown no improvements in translation quality. This could be explained by the lack of the Latvian phrase structure parser. Several other frameworks, including Joshua (Weese et al. 2011), Jane (Freitag et al. 2014) and cdec (Dyer et al. 2010), support syntax-based models as well.

Besides linguistically based tree models, Chiang (2007) has proposed hierarchical models where trees are automatically obtained from parallel texts. This approach has shown good results for English-Lithuanian, French-Lithuanian and Russian-Lithuanian language pairs in terms of the BLEU score. However, training and decoding speed has significantly decreased.

Comprehensive overview of hybrid approaches to machine translation is provided by Thurmair (2009). It includes work on pre and processing (Stymne 2011). Syntactic preordering (Xu et al. 2009, Isozaki et al. 2010)aims to rearrange source language sentences into order closer to target language. Itis useful for languages with different syntactic structures and word order. Post-processing can be used to correct MT output and improve translation quality (Stymne 2011, Mareček et al. 2011).

Also domain adaptation, specific treatment of terminology, multiword units and named entities, has been researched and has shown improvements in translation quality.

Recently research has turned to deep neural networks (e.g. Le et al. 2012; Auliet el. 2013).

Estonian specific approaches

  • [Fishel et al 2007] Estonian-EnglishStatisticalMachineTranslation: theFirstResults.Mainproblemsfound: wrong order of phrases and sparsedata.
  • [Kirik 2008] UnsupervisedMorphologyinStatisticalMachineTranslation. Bachelorthesis. Usingfactoredwordalignmentusualluyimprovestranslationquality, butusingfactoredwordreorderinglowers. Thebestfactorschemaiscreatedbyseparatingthewordto a stemfactorfromitsrightmostsuffix, and usingthestemfactorforwordalignmenttraining.
  • [Fishel et al 2010]. LinguisticallyMotivatedUnsupervisedSegmentationforMachineTranslation. (link)
  • [Kirik 2010] Languagemodelbasedimprovementsin statistikal machinetranslation. Masterthesis. Factored MT:anadditional (secondary) LM using POS tags, wordfrequencesintrainingcorpora and wordforms. No strategieswerefoundwhichwillimprovetranslationquality.
  • Any major papernotlisted?

3.Integration of a morphology knowledge into Estonian-English-Estonian MT system

Best practice of using morphology tools in MT

The phrase-based approach in SMT allows translating source words differently depending on their context by translating whole phrases, whereas target language model allows matching target phrases at their boundaries. However, most phrases in inflectionally rich languages can be inflected in case, number, tense, mood and other morphosyntactic properties, producing considerable amount of variations.

Estonianbelong to the class of inflected languages which are complex from the point of view of morphology. There are over 2000 different morphology tags for Estonian. With Estonian as the target language for SMT, the high inflectional variation of target language increases data sparseness at the boundaries of translated phrases, where a language model over surface forms might be inadequate to estimate the probability of target sentence reliably. Following the approach of English-Czech factored SMT (Bojar et al., 2009; TamchynaBojar, 2013), and English-Latvian/Lithuanian (Skadiņš et al., 2010)the most promising method to incorporate linguistic knowledge in SMT is to use morphology in factored SMT models (Koehn & Hoang, 2007). We have improved word alignment calculated over lemmas instead of surface forms, andwe introduced an additional language model over disambiguated morphologic part-of-speech tags in the English-Estonian system. An additional language model over morphosyntacticpart-of-speech tags can be built in order to improve inter-phrase consistency (Skadiņš et al., 2010;2014).The tags contain morphologic properties generated by a statistical part-of-speech tagger. The order of the tag LM was increased to 7, as the tag data has significantly smaller vocabulary.

At the moment we have evaluated the applicability of these methods for English-Estonian SMT using only small scale experiments, the large scale experiments using all collected parallel training data will be done until the end of the project.

Estonian part of speech tagger for MT

As part of speech tagger is necessary par factored SMT, we have created the Estonian part of speech tagger. There are two conceptually differenttypes of POS-taggers:

1)POS-taggers that perform POS-tag guessing

2)POS-taggers that perform POS-tag disambiguation.

The POS-taggers that are based on the disambiguation methodology perform morphological analysis of each token of a text and then perform POS-tag disambiguation by analysing the surrounding context (the context may include several tokens around the token being disambiguated and morphological information of the surrounding tokens). The guessers, on the other hand, rely solely on the text tokens and, depending on the machine learning algorithms applied, possibly also POS-tags of previously tagged tokens. I.e., if the disambiguation based POS-taggers have a relatively small set of possible POS-tags to select from when disambiguating a token, the guessing-based POS-taggers have to select the correct tag (i.e., the most likely tag in a particular context) from all possible tags. Both methods can be combined into hybrid methods that allow handling of out-of-vocabulary words better.

The POS-tagger developed in the project is based on the disambiguation methodology. The workflow of POS-tagging a single document is depicted in the figure below.

The workflow is as follows:

  1. At first, a document is broken down into sentences. For sentence breaking we use Tilde’s proprietary solutions developed prior to the project.
  2. Then, each sentence is tokenised. For tokenisation we use Tilde’s proprietary solutions developed prior to the project.
  3. Next, each token is morphologically analysed using the freely available Filosoft morphological analyser Vabamorf[2]. In order to use the morphological analyser, we developed a wrapper (a tool that integrates Vabamorf) that allows passing Tilde’s tokeniser output data to Vabamorf and converts Vabamorf output data in a format compliant to the POS-tagger’s input data. An example of the output format of the morphological analyser for the sentence “Eesti Vabariik on riikPõhja-Euroopas.” is as follows:

EestiEestiN--sg------f-00

EestiEestiN--sn------f-00

EestieestiG------f-00

VabariikvabariikN--sn------f-01

onoleVp-s--3--i-a----a------l-02

onoleVp-p--3--i-a----a------l-02

riikriikN--sn------l-03

Põhja-EuroopasPõhja-EuroopaN--ss------u-04

..T------05

The example shows that there are 3 possible morphological analyses for the token “Eesti”, one analysis for “Vabariik”, 2 for “on”, and 1 for the remaining three tokens. The task of the POS-tagger’s disambiguation module is to select the most probable tag from the given tags.

  1. Finally, as the last step, the disambiguation is performed by a disambiguation module that is based on averaged perceptron (Rosenblatt, 1958) methodology[3] and uses a pre-trained POS-tagging model.

The POS-tagger provides two output formats – a TreeTagger[4] compliant output format and a Moses[5] factored data compliant output format. Examples are as follows:

  1. TreeTagger format example.

EestiNEestiN--sg------f-

VabariikNvabariikN--sn------f-

onVoleVp-s--3--i-a----a------l-

riikNriikN--sn------l-

Põhja-EuroopasNPõhja-EuroopaN--ss------u-

.T.T------

  1. Moses factored data format example.

Eesti|Eesti|N--sg------f- Vabariik|vabariik|N--sn------f- on|ole|Vp-s--3--i-a----a------l- riik|riik|N--sn------l- Põhja-Euroopas|Põhja-Euroopa|N--ss------u- .|.|T------

In order to train the Estonian POS-tagging model, we used the morphologically annotated corpus of Estonian created by the University of Tartu[6]. As the corpus was annotated using a different tagset than the tagset used by the morphological analyser, we transformed the tags of the annotated data into tags provided for the tokens by the morphological analyser. For POS-tagger model training purposes for each token in the annotated data we identified the correct POS-tag, the correct lemma, and provided other possible analyses provided by the morphological analyser for the token.

During the semi-automatic training data preparation process approximately 2,500 sentences were discarded, because they contained tokens, for which the morphological analyser did not provide an analysis that could be uniquely matched to the annotated data tags (a possible explanation are mismatches between annotation guidelines used in the corpus creation process and the morphological analyser as well as annotation mistakes in the morphologically annotated corpus). After transformation of the annotated data, a total of 42,287 sentences (585,965 tokens) remained in the annotated training data corpus.

When the training data was ready, we trained the POS-tagging model in several iterative steps by improving the feature set (i.e., the context length, different morphological parameter configurations, etc.) used by the averaged perceptron learning algorithm. In each iteration, the POS-tagging model was evaluated using 10-fold cross-validation. The resulting POS-tagger achieves a precision of 97.51±0.08% with a confidence interval of 99%. This is a state-of-the-art result for POS-tagging of Estonian texts.

4.Parallel corpora and trained MT systems

This section accounts for the general-domain MT systems Tilde has built in 2014 and its comparison with some other MT systems. In the past at Tilde we have been building Estonian MT systems on a regular basis for at least 3 years trying to make it better each time. We have done both general-domain as well as some domain-specific MT systems. We have documented and presented our experience with that (Skadiņš et al., 2014).

This year our goal for building the new system general-domain MT system was to make the best possible general-domain English-Estonian MT system by making use of the best data set available with our current technology.

We train a SMT system with the LetsMT MT platform (Vasiļjevs et al., 2012) which is based on Moses toolkit (Koehn et al.,2007). In this MT system training we did not use any newer technological advancement in comparison to the previous system – they are in the pipeline and will be examined in future system builds in 2015.

We made use of some new public corpora resources as well as data processed within EOPC project in addition to the formerly used data. We can still see a correlation between amount of training data used and quality of MT system.

We use both publicly available corpora collected by other institutions and corpora collected by us. The most important sources of data are:

1)Publicly available parallel corpora – Europarl corpus, DGT-TM, JRC-Acquis, ECDC and other corpora available from the Joint Research Center, the OPUS corpus, which includes data from European Medicines Agency, European Central Bank, EU constitution and other.

2)Parallel corpora collected – national legislation, standards, technical documents and product descriptions widely available on the web (some, examples: EU brochures from EU Bookshop, news portals and many more.

3)Monolingual corpora collected – mainly data crawled from the web (state institutions, portals, newspapers etc.).

See Table 1 for amount of data used in the training of our SMT systems.

We used BLEU metric for the automatic evaluation using a general-domain evaluation corpus that represents general domain data which is mixture of texts in different domains representing the expected translation needs of a typical user. The corpus includes texts from the fiction, business letters, IT texts, news and magazine articles, legal documents, popular science texts, manuals and EU legal texts, and it contains 512 parallel sentences in English and Estonian.

Table 1. Amount of training data and results of the automatic evaluation

Language direction / Corpora size, sentences / BLEU score
Parallel / Monolingual
English – Estonian, 2012 / 10.5 M / 28.3 M / 23.78
English – Estonian v2.1, 2013 / 12.5M / 33.1M / 24.22
English – Estonian v2.2, 2014 / 17.8M / 37.1M / 24.48

The summary of automatic evaluation results in comparison with Google translator is presented inFigure 1.

Figure 1. Our MT systems automatically compared to Google

For human evaluation of the systems we used ranking of translated sentencesrelative to each other. This is the official determinant of translation quality used in the Workshop on Statistical Machine Translation shared tasks.The summary of human evaluation results in comparison with Google Translator is presented inFigure 2:

Figure 2. Our MT systems compared to Google Translator by human evaluation

The work of training a new English-Estonian MT system has proven to be worthwhile as the results obtained differ positively both in terms of comparing with the previous versions of Tilde MT systems as well as in comparison with the competitor.

5.R&D plans for 2015

As a future directions for quality improvements we see:

  • Continuing parallel data collection (including (i) advanced content crawling methods and (ii) targeted MT output post-editing to create new MT training data)
  • More efficient use of Estonian morphology knowledge in SMT (including (i) improved word alignments using morphology and language specific data pre-processing rules, (ii) improved language modelling)
  • Better treatment of non-translatable tokens (e-mails, web addresses, numbers, brackets, quotes, tags etc.)

References

Michael Auli, Michel Galley, Chris Quirk, and Geoffrey Zweig. 2013. Joint Language and Translation Modeling with Recurrent Neural Networks. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.

EleftheriosAvramidis and Philipp Koehn. 2008. Enriching morphologically poor languages for statistical machine translation. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, page 763-770, Columbus, Ohio, USA.Association for Computational Linguistics.

O. Bojar, D. Mareček, V. Novák et al., English-Czech MT in 2008, in Proceedings of the Fourth Workshop on Statistical Machine Translation, Athens, Greece, Association for Computational Linguistics, 2009

Ann Clifton and Anoop Sarkar. 2011. Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction. ACL 2011.

C. Dyer, A. Lopez, J. Ganitkevitch, J. Weese, F. Ture, P. Blunsom, H. Setiawan, V. Eidelman, and P. Resnik. cdec: A Decoder, Alignment, and Learning Framework for Finite-State and Context-Free Translation Models. In Proceedings of ACL, July, 2010

M. Freitag, M. Huck, and H. Ney. Jane: Open Source Machine Translation System Combination. In Conference of the European Chapter of the Association for Computational Linguistics (EACL), Gothenburg, Schweden, April 2014.

Hideki Isozaki, KatsuhitoSudoh, Hajime Tsukada, and Kevin Duh. 2010. Head finalization: A simple reordering rule for SOV languages. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 244–251.

Philipp Koehn, Franz Josef Och, Daniel Marcu (2003). Statistical phrase based translation. In Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North American Chapter of the Association of Computational Linguistics.

Koehn, Philipp and Hoang, Hieu (2007): Factored Translation Models, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, OndrejBojar, Alexandra Constantin, Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. ACL 2007, demonstration session.

Hai-Son Le, Alexandre Allauzen, and Francois Yvon.2012.Continuous Space Translation Models with Neural Networks. In Proc. of HLT-NAACL, pages 39–48, Montreal, Canada. Association for Computational Linguistics.

David Mareček, Rudolf Rosa, Petra Galuščáková and OndřejBojar: Two-step translation with grammatical post-processing. In Proceedings of WMT 2011, EMNLP 6th Workshop on Statistical Machine Translation, Edinburgh, UK, pp. 426–432, 2011

Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain.Psychological review,65(6), 386.

Skadiņa I., Brālītis E. 2008. Experimental Statistical Machine Translation System for Latvian. // Proceedings of the 3rd Baltic Conference on HLT, Vilnius, 2008, 281-286.

Skadiņa I., K. Levāne-Petrova, G.Rābante.2012. Linguistically Motivated Evaluation of English-Latvian Statistical Machine Translation. // Human Language Technologies – The Baltic Perspective - Proceedings of the Fifth International Conference Baltic HLT 2012, IOS Press, Frontiers in Artificial Intelligence and Applications, Vol. 247, pp. 221-229.

Skadiņš, R., Goba, K., & Šics, V. (2010).Improving SMT for Baltic Languages with Factored Models. In Proceedings of the Fourth International Conference Baltic HLT 2010, Frontiers in Artificial Intelligence and Applications, Vol. 2192 (pp. 125–132). Riga: IOS Press.