Building Machine translation systems for indigenous languages
Ariadna Font Llitjós, Roberto Aranovich[1], Lori Levin[2]
Carnegie Mellon University[a1][RN2]
Key Words: natural language processing, machine translation, Mapudungun, Quechua, cooperation with communities, anything else?
1. Introduction
In this paper we focus on the cooperation between a team of computational linguists and two communities belonging to different indigenous languages in Latin America, Mapudungun in Chile(2002-2005) and Quechua in Peru (2004-2005).
In both cases, this cooperation was embraced by (part of) AVENUE, a project devoted to fast and affordable development of Machine Translation (MT) systems for resource-poor languages. This effort could be summarized as follows: the members of the community provide us with the data, and we process the data and provide them with the training required to collect/gather the data and Natural Language Processing (NLP) tools,such as transcribed spoken corpus, a dictionary, a morphological analyzer, and ultimately an MT system.
1.1.Avenue project
The AvenueProject at the Language Technologies Institute (LTI) at CarnegieMellonUniversity focuses on affordable machine translation for languageswith scarce resources. With respect to machine translation, “scarce resources” refers to lack of a large corpus in electronic form orlack of native speakers trained in computational linguistics. Theremay be other difficulties as well, such as spelling and orthographical conventions that are not standardized and missing vocabulary items.
Avenueuses a multi-engine approach to machine translation (Frederking and Nirenburg, 1994)[a3]in order to make the best use ofwhatever resources are available:
1. If a parallel corpus is available in electronic form, wecan use example based machine translation (EBMT)(Brown, 1997; Brown and Frederking, 1995), orStatistical machine translation (SMT).
2.If native speakers areavailable with training in computational linguistics, ahuman-engineered set of rules can be developed.
3. Finally, if neither a corpus nor a human computational linguistis available, Avenue uses a machine learning technique calledSeeded Version Space Learning(Probst, 2005) to learn translation rules from data that iselicited from a native speaker.
The last approach assumes the availability of a small number ofbilingual speakers of the two languages, but these need not be linguisticexperts. The bilingual speakers create a comparatively small corpus of wordaligned phrases and sentences (on the order of magnitude of a few thousandsentence pairs) using a specially designed elicitation tool (Probst et al. 2001).
From this data, the learning module of our system automatically infers hierarchical syntactic transfer rules, which encode how constituent structures in the source language (SL) transfer to the target language (TL). The collection of transfer rules, which constitute the translation grammar, is then used in our run-time system to translate previously unseen SL text into the TL text (Probst et al. 2003). [a4]
Our transfer-based MT system consists of four main modules: elicitation of a word aligned parallel corpus; automatic learning of translation rules; the run time transfer system (which might include a morphology analyzer and generator), and the interactive and automatic refinement of the translation rules.
1.2. Working with the communities(Lori)
Full partners, not just informants.
… vs working with the government
1.3. CMU team[a5]
Jaime CarbonellDirector of the Avenue project at CMU
Rodolfo Vegainternational relations coordinator?
Lori LevinMain faculty advisor for the Avenue-Mapudungun project
Ralf BrownEBMT main developer and specialist
Alon Lavieparsing specialist
Ariadna Font LlitjósPhD student, working on Mapudungun and Quechua, particularly on interactive and automatic refinement of translation rules.
Christian MonsonPhD student, working mostly on Mapudungun, focusing on morphology analysis and how to learn morphological clusters automatically
Erik Petersontransfer engine main developer
Kathrin Probstgraduated PhD student who worked on the automatic learning of translation rules.
(From Pitt)
Roberto Aranovich PhD student, working on the Mapudungun-Spanish translation rules.
Pascual Masullo linguistics professor at Pitt, who did consultant work for the Mapudungun-Spanish dictionary evaluation/development.
2. Mapudungun efforts/cooperation
[from LREC 2002, abstract: I changed the tense of the last sentence and added some products]
Mapudungun is spoken by over 900,000 people (Mapuche) inChile and Argentina. Thanks to an active bilingual and multiculturaleducation program, Mapuche children are now being taught to beliterate in both Mapudungun and Spanish. The Chilean Ministry ofEducation teamed up with the Language Technologies Institute'sAvenue project to collect data and produce language technologiesthat support bilingual education. The main resource that has come outof the Mineduc-LTI partnership is Mapudungun-Spanish parallel corpusconsisting of approximately 200,000 words of text and 120 hours oftranscribed speech. Other products that have been developed or are still underdevelopment in the framework of the Avenue project are a spelling checker, an online dictionary and a machine translation prototype.
For more details about multilinguism in Chile, see Levin et al., 2002.
Lori: should we talk about the meetings we have had with them, which are concrete examples of our cooperation? (I went down there inApril and then November of 2002)
2.1. Chilean team
[from LREC 2002 section 4]
In order to produce language technologies for bilingual education, we need people with several kinds of expertise, including computational linguists (who don't need to know the language we are processing), bilingual education experts, and native speakers with conscious and implicit linguistic knowledge about their language.
The first partnership we established was between the LTI and Instituto de Estudios Indigenas (IEI - Institute for Indigenous Studies) at the Universidad de la Frontera (UFRO). In a preliminary meeting in May, 2000 we agreed to collaborate in building language technologies to respond the demands of intercultural bilingual education programs for the Mapuche.
After LTI and IEI agreed on the Avenue-Mapudungun vision and goals and established a plan of action for the year 2001, the Intercultural Bilingual Education[a6] Program of Ministry of Education (Mineduc) agreed to participate in the project, and to fund 90% of the Avenue/Mapudungun expenses for the year 2001. This support has been extended for the year 2002. Mineduc also provides a policy framework that allows the Avenue-Mapudungun project to be in tune with the national plans to improve the quality and equity of the Chilean education system with respect to ethnic communities.
Ministry of Education
Carolina Huenchullan Arrúe –National Coordinator of the Bilingual Multicultural Education Program in the Ministry of Education inChile.
Claudio Millacura Salas–Pedagogical Coordinator(encargado pedagogico) of the Bilingual Multicultural Education Program in the Ministry of Education in Chile.
IEI-UFRO
The IEI team consists of near-nativespeakers of at least one major dialect of Mapudungun. All of the teammembers but one are of Mapuche descent. They are also bilingual inSpanish, and accustomed to writing in Mapudungun. The team alsoincludes Mapuches with training in linguistics and involvement in bilingual education:
(From email sent by Rodolfo, September 21, 2005)
Eliseo Cañulef Team Coordinator. Specializing in intercultural bilingual education.
Hugo Carrasco Linguist, UFRO's Dean of the Humanities and education Faculty.
Rosendo Huisca Distinguished native speaker.
Hector Painequeo Linguist. professor at UFRO, with a masters in Linguistics (Mexico)??
Flor Caniupil Senior transcriber/translator.
Luis Caniupil Huaiquiñir Data Collection specialist.
Marcela Collio Calfunao Transcriber/translator.
Cristian Carrillan Anton Transcriber/translator.
Salvador Cañulef Computer and software support specialist.
2.1. Mapudugun Database
[From lrec 2002, section 5]
The first plan of action of Avenue-Mapudungun was to make aparallel corpus of Spanish and Mapudungun that could be used forcorpus-based language technologies (language technologies that do notinvolve human rule engineering) and could also be used for corpuslinguistics or corpus-based computer-assisted language learning. Thecorpus has two main parts: written texts and transcribed speech. Bothparts of the corpus (written and spoken) were collected by a team thatwas assembled at the IEI.
2.1.1. Written corpus
[From lrec 2002, section 5.1]
The written Mapudungun corpus consists of historical documents andcurrent newspaper articles. The two historical texts are Memorias de Pascual Coña, the life story of a Mapuche leaderwritten by Ernesto Wilhelm de Moessbach; and Las ÚltimasFamilias by Tomás Guevara. The two historical texts were firsttyped into electronic form as exact copies of the originals and thenwere transliterated into the orthographical conventions chosen by Avenue-Maupdungun. The modern newspaper, Nuestros Pueblos is published by the Corporación Nacional de Desarrollo Indígena (CONADI). The length of the text corpus is about 200,000 words.
2.1.2. Speech corpus
[From lrec 2002, section 5.2]
The corpus of spoken Mapudungun consists of 120 dialogues, each ofwhich is one hour long. The content and recording methods for thespoken corpus are based on several decisions made by LTI and IEI:restricting the corpus to a limited semantic domain, inclusion of themajor dialects of Mapudungun, recording quality that is suitable forspeech recognition, and design of orthographical conventions to beused by Avenue-Mapudungun.
The subject matter of the speech corpus
Since machine translation systems for restricted domains can usuallyachieve higher quality than general purpose machine translation, wechose to record a corpus in a limited domain, specifically primary andpreventive health, both Western and Mapuche traditional medicine. AMapudungun native speaker from the IEI team conducted conversationswith informants based on a guide composed by the IEI team to graspkeywords and narrative styles used in the target domain. Theinformants are asked to tell their experiences on illnesses andremedies that they or their relatives have experienced. They are askedto provide a complete account of symptoms, diagnostics, treatments,and results. Figure1 contains an excerpt from the 70questions that were used to prompt the discussion. In accordance with Mapuche culture, the interviews were scheduledahead of time and took place in the informant's house, or in rarecases, in the informant's place of work.
[Roberto: ¿puedes comprobar que el Mapudungun esté correcto?][RN7]
I. Mantención de la salud y enfermedades
1. Chumkeymi tami külfünküleal. (¿Cómo hace para mantenerse así de bien?)
2. Rüfkünungey am tami amulngen kiñe machimew.
(¿Es verdad que el médico lo mandó donde una machi?)
. . .
II. Embarazo - Niepeklen
1.Tunten püñeñ dew nieymi. (¿Cuántos hijos ha tenido?)
2.Tunten mongeley. (¿Cuántos están vivos?)
3.Chumngekefui tami niepüñekülen, kutrankawkefuimi kam femkelafuimi.
(¿Cómo eran sus embarazos? ¿Tuvo algún problema?)
...
III. Las enfermedades - Puke kutran
1. Chumngey tami kutran. (¿En qué consiste su enfermedad?)
2. Chem. üy niey tami kutran.(¿Cómo se llama su enfermedad?)
3. Chem. Dewmangekey pelontual chem. Kutran niel.
(¿Qué tipo de exámenes se necesitan para efectuar el diagnostico?)
...
Figure 1: Examples of conversation topics in the Spanish-Mapudungun parallel corpus.
The informants for the speech corpus
The age of informants are between 21 to 75 years old, most of thembetween 45 and 60 years old. All informants are fully native speakers.Most informants work as auxiliary nurses in rural areas of the ChileanPublic Health System, or are knowledgeable in traditional Mapuchemedicine. Among the informants are some machi, the Mapuches'specialized medicine wise-women, who are asked to answer theinterviewer's questions without providing specialized knowledge thatis only known by and transmitted to initiated people.
The dialects included in the speech corpus
There are four major Mapudungun variants: Lafkenche, Nguluche,Pewenche and Williche. For the oral corpus the IEI team choose to workwith three dialects (Lafkenche, Nguluche, Pewenche) that are quitesimilar with some minor semantic and phonic differences. The Willichevariant presents some morpho-syntactic differences, specifically in thepronouns and verb conjugations. The IEI team will return to Williche ata later stage in the project.
Recording and Transcription methods
After several attempts by other native Mapudungun speakers, a member of the Chilean team, Luis Caniupil Huaiquiñir, succeeded in the hard task of getting people talking in front of a microphone about medical issues. Luis worked at a Hospital and thus knew several nurses and medical professionals, which proved to be of great help for the project.
[Lori: you probably know how to say that so that it sounds more formal and accurate]
The dialogues were recorded using a Sony DAT recorder (48kHz) and Sonydigital stereo microphone. The tapes are downloaded using CoolEdit2000 v.1.1 ( Fortranscription, we use the TransEdit transcription tool v.1.1 beta 10,developed by Susanne Burger and Uwe Meier[3]. Thesoftware synchronizes the transcribed text and the wave file. It alsoshows the actual wave, making it easy to identify each speaker turn aswell as simultaneous speakers. The transcribers use the LTI'stranscription conventions for noises and disfluencies includingaborted words, mispronunciations, poor intelligibility, repeated andcorrected words, false starts, hesitations, undefined sound orpronunciations, non-verbal articulations, and pauses. Foreign words,in this case Spanish words, are also labeled.
Establishing the Orthography
Language technologies for languages with scarce resources often sufferfrom the lack of a standardized character set and spellingconventions. Because of the availability of experts on the IEI team, Avenue-Mapudungun decided to create an orthographically uniformcorpus. However, because there are competing orthographies forMapudungun, we agreed to develop orthographical conventions that would be for use only by Avenue-Mapudungun. This took much longer than we had anticipated (do you think we should say this?). At a latter time, we willevaluate the social and cultural acceptability of the Avenue-Mapudungun orthography.
The IEI developed a supra-dialectal alphabet that comprises 28 lettersthat cover 32 phones used in the three Mapudungun variants. The maincriterion for choosing alphabetic characters is to use the currentSpanish keyboard that we find in all computers in Chilean offices andschools. The alphabet uses the same letters used in Spanish for thosephonemes that sound like Spanish phonemes. Diacritics such asapostrophes are used for sounds that are not found in Spanish.
Añadir evaluación y propuesta de Roberto[RN8]
Translations
Do we want to say something about this? Spanish fluency problem… which will badly reflect in the translations output by the EBMT system.
[We could add the type-token curves for Spanish and Mapudungun in lrec04 paper, if we have enough space]
2.2. Developing Natural Language Processing Tools
As mentioned above, Mapudungun is a test case for Avenue inwhich we are experimenting with three approaches to machinetranslation. First, we will focus on the mostexperimental of our machine translation methods automatic learningof transfer rules from carefully elicited sentences.[a9] There are five main components of the Avenuerule-learning system: theelicitation system, morphology learning, Seeded Version Space Learningof transfer rules, the run-time transfer rule system and the Rule Refinement module, which includes the Translation Correction Tool.
2.2.1. Elicitation Corpus
The purpose of the elicitation system is to collect a parallel corpuswhose content is controlled in order to ensure that it illustrates thebasics of the language being elicited. The elicitation system (Probst et al., 2001; Probst and Levin, 2002) can be used by an informant who is bilingual in the language ofelicitation and the language being elicited. In the case of Avenue-Mapudungun, the language of elicitation is Spanish and thelanguage being elicited is Mapudungun. The informants are requiredonly to translate Spanish sentences into Mapudungun and to alignSpanish words to Mapudungun words as well as they can. Because ahuman linguist may not be available to supervise the elicitation, auser interface is available for presenting sentences to an informantand allowing the informant to translate and align sentences. Somepotential pitfalls of automated elicitation are described inProbst and Levin, 2002.
He has sold both of his cars.
El ha vendido sus dos automóviles
fey weluiñi epu awtu
He can move both of his thumbs.
El puede mover sus dos pulgares
fey pepi newüleliñi epu fütrarumechangüll
He loves both of his sisters.
El ama a sus dos hermanas
fey poyey ñi epu deya
He loves both of his brothers.
El ama a sus dos hermanos
fey poyey ñi epu peñi
Figure 2: Example from the Elicitation Corpus.
A fragment of the elicitation corpus is shown in Figure 2. In each example, the elicitation sentenceis shown in English and Spanish. In actual use, however, Mapudunguninformants would see only the Spanish elicitation sentence. The third line of each example shows the Mapudungun translation provided by theMapudungun informant.
The elicitation corpus follows two organizational principles. Thefirst is compositionality. Small phrases are elicited first, and arethen combined into larger phrases. For example, simple noun phrasesare elicited first followed by noun phrases containing possessors, simplesentences, and multi-clausal sentences. Compositionality in thecorpus facilitates the learning of compositional transfer rules.
The second organizational principle of the elicitation corpus iscreation of minimal pairs of sentences. Minimal pairs of sentencesdiffer in only one feature such as tense, number of the subject,gender of the possessor, etc. A process of feature detection comparesthe members of the minimal pairs in order to make a first guess atwhat grammatical features (verb agreement with subjects and objects,number, tense, etc.) are marked in the language being elicited. Figure 2 shows a fragment from theelicitation corpus illustrating the notion of inclusion bothand alienable and inalienable (kinship and body parts) possession.
At the beginning of the Mapudungun project, the elicitation corpus had around850 sentences, and its coverage included basic transitive andintransitive sentences, animate and inanimate subjects and objects,definite and indefinite subjects and objects, present/ongoing andpast/completed events, singular, plural, and dual nouns, simple nounphrases with determiners and adjectives, and possessive noun phrases.Following guides for field workers such as Comrie-smith and BouquiauxThomas (1992) we expect theelicitation corpus to grow to several thousand sentences.
[I updated this to reflect the current status of the project, please double check]
The elicitation corpus is used for training automatic acquisition of MT transfer rules. However, we do not expect the coverage of this system tobe very broad given the time frame of the project. While the rule-based MT system is currently under development, we havealready implement a first EBMT prototype, based on the parallel corpus from the transcribed and translation medical domain corpus.