Roman Transliteration of Indic Scripts

Roman Transliteration of Indic Scripts

Kavi Narayana Murthy, Srinivasu Badugu

Department of Computer and Info.Sciences, University of Hyderabad

email: ,

Abstract

In this paper we analyze the need for Roman Transliteration for Indic scripts. We evaluate the pros and cons of various schemes in use today and argue for a scientifically designed standard scheme. We offer one such scheme for the consideration of all the experts. We believe the ideas we have presented here will also be of interest to people in many other countries where the language situation is similar.

1. Translation and Transliteration:

Language is all about systematization of mappings from sound patterns to meanings, through which we can think and also communicate our thoughts and feelings to others. A script is a systematization of rendering language in the written form. The relation between language and script is not one-to-one: a given language can be written in anyscript and a given scriptcan be usedfor writingany language,although we may have usual preferences.

In translation, texts from one language are mapped to texts in adifferent language. This is intended to be meaning preservingtransformation. The words used, the structure of words and sentences,etc.could change. The script used for rendering these texts is notimportant. The source language text as well as thetarget languagetext can be in any suitable script. They could even be in the samescript. For example

raam ne siitaa ko kittab dee diya

is a Hindi sentence rendered in Roman script, while its translation inKannada could be

raamanu siitege pustakavannu koTTubiTTanu

Transliteration is not translation, here the language does not change,only the script used to render the text is changed. Why should we eventhink of rendering one language in some script other than the defaultscript? People may know a language but not the script - rendering thelanguage in a script which they know can help them to read andunderstand the texts. For example, if you know Hindi orKannadalanguage but you do not know the scripts, you can read and understandthe Hindi and Kannada sentences above if you know the Roman script andthe conventions used therein. If you do not know the language also,you can still read these sentences sinceyou know the Romanscript. The main goal oftransliteration is to enable the reader toread that is, pronounce the words as accurately as practicallypossible. Pronunciation is important, orthography is not the basis atall. After all, the most important thing about a word or sentence isits meaning and the next most important thing is itspronunciation. Transliteration has several other benefits as we shallsee soon.

2. Romanization:

It is possible to think of transliteration from anyscript to any other. Here we shall focus mainly on the Indic script to Romantransliteration. Here wewrite textsin Indianlanguages using theRoman alphabet (which is the same as what we use for writing English.This wholearticle isin Roman script.) This process is known asRomanization. This can be bidirectional and if so, we can actually mapfrom any script to any other script via Roman. The ideas presented inthis paper aregeneric and applicable to otherlanguage and scriptscenarios anywhere in the world.

Romanization has several advantages. 1) To render any language in anyfontsand a suitable rendering engine. If weneed to type in these texts, we also need a keyboard driver. These maynot always be available. Roman script is universally available on allcomputers. Even when all the required tools are available, typing inIndic scripts is still complicated and error prone. Typing in Romancould be much faster and less error prone. 2) Readers may know thelanguage butnot the script. Romanization helps. 3) When severaldifferent languages and scripts need to be used together, there willbe frequent need for shifting from one system to the other. This canbe cumbersome and error prone. Uniform use of Roman makes itsimpler. 4) Typing in Roman can be simpler and faster than typing intheir own script for many people. 5) Processing of texts rendered indifferent scripts mayrequire different techniques for dealing withidiosyncrasies if any. Rendering all texts in a uniform notation suchas Roman can mitigate this problem. Computers have originated andevolved under the Roman script. Roman is the most natural anddirect. It can besimplerand efficient too, especially fromprogramming point of view. 6) Languages which do not have a script oftheir own can also be rendered in Roman. 7) When the language andscript used is English but we wish to include bits of local languages,such as in English newspapers, notices, brochures, sign boards etc.,Roman is natural. 8) All existing software would simply run on Romanrendered data. If data is encoded in any other non-standard form (suchas in a TTF font), standard software may or may not work. 9) Renderingin Roman makesit readable by people speaking various languageswhereas rendering the same in local language script restricts.Commercial advertisements such as productinformation and cinemaposters would gain by rendering in Roman.

This whole article is written in the Roman script. The whole paper isbasically written in English language and the usual spelling rules ofEnglish are sufficientto read these parts. However, we includeexamples from Indian languages and in order to be able to read themcorrectly andto understand andappreciate the issuesraised andsolutions proposed, you will need to know the conventions we are usingin this paper. We have described the Romanization conventions we havefollowed in this paper in the appendix. Readers may please go throughthese conventions before reading on.

3. Need for Standards:

There are no well established or agreed standards for rendering Indicscripts in Roman. People use a wide variety of arbitrary, illogical,unscientific, highly confusing mappings, mostly driven by intuitionsbased on English spelling rules, making it difficult for readers toread. We may tend to think that since we are using a script that isused for English, we canas well use the Englishway of renderingIndic script. This would be the worst choice.

English writing system is alphabetic, we need to learn, store and usespelling rules tomap spoken words to written form. These spellingrules are quite arbitrary and unscientific. English uses odd ways ofwriting long and short vowels (cut-caught, fit-feat, pull-pool,let-late, colour-collar, floatation-float), uses different spellingsfor the same sounds (meat-meet, their-there, too-two, are-or, sun-sonetc.), same letters for different sounds (put-but, fat-fate,fit-fight, poll-pool, peg-page, pill-philosophy, case-chase, etc.),strange spellings (daughter, school, women, etc.), silent letters(psychology, coup, etc.). Therefore, usingEnglish spellingsas abasis for Romanization would only add to confusions. More importantly,it is not right to assume that all users would know English. We canactually develop a simpler, more uniform, more scientific, easy tolearn and use system of transliteration.

English does not have a simple and uniform way of depicting long andshort vowels. English uses 'oo' and 'ee' to indicate long vowels but'aa', 'ii', 'uu' etc. are not found in English. Further, 'oo' is usedfor the ‘uu’ sound, not as a long counterpart of the ‘o’sound. Similarly, ‘ee’ is used for the 'ii' sound, not for the long counterpart of 'e' sound. Therefore, mere intuition is not sufficientto render long and short vowels systematically using the conventionsof English language. Further, there is no natural way in English toshow the differencebetween’d’and ‘D’ sounds, as also‘t’ and 'T'sounds. We need to have a systematic way of handling 't', 'th', 'd','dh', 'T', 'Th' 'D' and 'Dh' sounds or there will be too manyconfusions. English uses 'ch' for the 'c' sound and so intuition tellsus that we should use ‘chh’ for the ‘ch’ sound but then we are notusing thesecond letter‘h’ consistently. The anusvaara ('M') ispronounced differently in different contexts and people use m' or 'n'arbitrarily. Since we have more than 26 basic varNa-s or phonemes inour chart, either we must use multi-letter mappings or we must useupper case and lower case letters differentially or we must usespecial characters beyond 'a-z' or two or more of the above. Specialcharacters may occur in their own right in texts, leading toambiguity. Also, people often fail to realize that the whole purposeof Romanization could be to help readers of other languages. They donot know or they do not care about other languages and use mappingswhich arevery confusingto the readers. Since a given word mayinvolve several of these situations, one word can actually beinterpreted in a large number of ways and the readers will be left toguess. Thus, English spellings and our intuitions about them cannot bethe right basis for Romanization.

Romanization has become a basic necessity in India. It is thereforeimperative that we develop and use standards. This would makeeverybody's life so much better and we can avoid a whole lot ofavoidable confusions and errors. The situationis similar in manyother countries too.

It is not that there are no standards at all - several proposals existand are being followed by various groups in pockets. There are manysuch proposals and suggestions around and it is time we take a carefullook at all of them and come out with national/international standardsthat can be used by all.Even the national/international standardssometimes have problems and issues. For example, Unicode is based onthe ISCII standard but it does not seem to understand ISCII fully. InISCII, two consonants in sequence will not automatically join to forma consonant cluster. We have a special symbol called 'halaMt' to formconsonant clusters. 'halaMt' removes the implicit vowel 'a' in aconsonant. In case we need to show two consonants in sequence withoutany vowel in between, and we wish to depict the first consonant as apure consonant without any implied vowel in it, ISCII standardrequires the use of two 'halaMt-s'. Perhaps with the idea of providinga generic solution to allthe languagesof theworld with similarproperties, Unicode has introduced the concepts of zero width joinerand zero-width non-joiner. Since 'halaMt' continues to exist, now wehave more than one way of effecting the same thing, leading toambiguities.

4. General Considerations

Indian scripts are highly phonetic in nature. That is,the writingdepicts the pronounced phonemes quite closely and fairly accurately.Correct pronunciation is the right basis for transliteration of Indianscripts. Units ofwriting correspond to phonemes, notto narrowphonetic realization variations. Hence coding should bebased onphonemic considerations, not the phonetic in a narrowsense. Allophones should not be coded.

The Roman alphabet has only 26 letters whereas many Indic scripts havemore than 50 basic units and allographs as well. Therefore, we need to1) use upper case and lower case letters to represent different units,2) use two or more Roman letters per Indic script unit 3) and/or useother special symbols. Mixing of upper and lower case letters wouldnot work in case-insensitive systems. If we try to avoid case mixing,we will findit hard to develop a simple, natural, readable schemeunless we use special characters.

Combining two or more letters is based on the assumption that thosecombinations never occur otherwise. This may be true of the languagebut a script can be used to write not only this language but otherlanguages as well. For example, wehave used ‘kh’, ‘ph’ etc. foraspirated sounds knowing that these could also mean a cluster of 'k'or ‘p’ and 'h' consonants. These latter possibilities areexceptionally rare in our native languages. However, in the rare casethat we actually wish to depict the consonant cluster formed out of,say, 'k' and 'h', we can usean escape mechanism. The backslash symbolis widely usedin computer science to form escape sequences and wecould resort to the same or similar devices.

Capitalizing proper names and the first letter of the first word of asentence are conversionsof English language, these donot hold inIndian languages and we should be careful not to capitalize for thesepurposes.

If we need to use these notations on computers, it is better to useonly those symbols which are readily available on the standardkeyboard. Other special characters would be difficult to type. Also,simple, linear arrangements are easier and better compared todiacritic marks which may appear above or below a letter.

If the rendered texts need to be stored and processed by the computer,it is better to use one-to-one mappings where each phoneme is mappedto one single letter. However, this may force us to use mixture ofcases and/or special symbols.

It is better to avoid the use of symbols which can appear naturally intexts. For example, symbols such as the colon, quote mark, hyphen,star and periodshould beavoided as they can occurin texts withdifferent meanings.

Encoding should be preferably unique. This makes it possible to revertback to the original representation without loss or distortion. It isobvious that if the text includes Indian languages as well as English,we can Romanize but we cannot go back to the original. If we try, theoriginal English texts will also be rendered in the chosen Indicscript, unless we have a way of marking up English and Indic sectionsexplicitly. If the original text does not include Roman letters, itshould be possible to do round-trip conversion.

Most Indian scripts, with the exception of the few Perso-Arabicscripts, have a common origin in the ancient braahmi script, they allfollow the same fundamental principles and conventions, they arelargely common too. All are phonemic representations. Variations inthe phoneme sets are minimal, although the visual appearance may varywidely. It would therefore be better to have a common or a largelycommon scheme for all Indic scripts rather than completely differentschemes for each language. If a small super-set is built by takinginto consideration of all the Indic scripts, we would then be able tomap from any script to any script via Roman.

Tamil is one major exception - Tamil script has a much smaller numberof units. Naturally, what a Tamilian would find natural would soundvery strange for non-Tamilians and vice versa. For example, Tamilianstend to write dha for da, k for g, g for h, etc, causing majorconfusions. We should remember that the whole purpose of Romanizationis to make it easy for speakers of other languages toread ourlanguage. We should therefore take their views and requirements asmore importantrather thanimposing ourviews and expectations onothers.

Some languages including Sanskrit and Hindi have only long versions of'ee' and ‘oo’ vowel sounds, there are no short counterparts. Theseare almost always pronounced as long vowels, they are written andpronounced as long vowels in other languages/scripts and musttherefore be rendered as 'ee' and 'oo' respectively. Therefore, thecorrect rendering would be 'veeda' and 'yooga', not 'veda' or 'yoga'.

But some languages, notably Hindi, have the convention of dropping thelast vowel in a word in pronunciation but we must remember that weare not doing a phonetic transcription. Here, we only need to ensurethat what is written willbe readaccurately byspeakers of otherlanguages. If youwant readersto read it as ‘yoog’, render it assuch, not as 'yooga'. If you write 'yooga', the readers will read thefinal vowel after the‘g’ sound. But if you are interested inrendering the text in as natural a way as possible in the targetlanguage/script, then you must render this word as 'yooga'. In anycase, there must be a long vowel 'oo' in the standard.

Although mixing upper case and lower case letters is generallyconsidered essential, this idea should not be carried too far. Toomany upper caseletters mixed up with lower case letters makes thetext look ugly and typing becomes more difficult and error prone dueto the need for frequent use of the shift keys.

While respecting the above considerations, shorter codes would bepreferred as this would save storage space.

Most importantly, the rendered text should be natural, easy to readand type in. It should be easy to learn and easy to remember. Itshould not lead to an increase in typing errors. After all, computersare at our service, not the other way round. Making things difficultand unnatural to human users in the name of making things easier forthe computer is no good. Some proposed schemes are especiallynotorious in this respect. For example, 'kRShNa' is written as'kqRNa','kRti' is written as 'kqwi', jnYaana' is written as 'jFAna','ceppaaru' is written as 'ceVppAru', 'rudruDu' is written as'ruxrudu', 'peeru' is written as 'peru', 'RShi' is written as 'qRi','kaLa' is written as 'kalYa' in the so called w-x scheme, making ithighly unreadable.

5. Alternatives and Issues

We have long and short vowels in Indian languages and we need a simpleand uniform way of depicting this difference. One choice is to usedouble letters. We do use double vowels to show long vowels in Englishtoo (cool, feet, etc.) but the point is the short form of a long vowelis not the same vowel used once (*cul, *fet). We should use a simplerule uniformly everywhere.

Some people use capital lettersfor long vowels. Capitalization istotally absentin Indicscripts and itcarries noconnotation atall. Even in English, capital letters are only indicative or propernames andbeginnings of sentences, theyhave nothing todo withpronunciation. The letters 'a' and ‘A’ are pronounced the same way inEnglish. Therefore, using capital lettersfor longvowels would beworst choice.