1.  PREPROCESSING

4.1  INTRODUCTION

In Chapter 2, the research work reviewed information retrieval and cross language information retrieval. Based on the literature review a framework is proposed in Chapter 3. In the following chapter, the pre-processing of user’s query is explained. The preprocessing stage accepts the query in Telugu and processes it using the grammar rules and ontology to arrive at an intermediate English construct. This will then be given to the search engine in the post processing stage. The grammar rule structure and the ontological model are also explained. Finally, case studies of how the input Telugu query is converted to the output English intermediate constructs are shown.

4.2  METHODOLOGY OF PROPOSED PRE-PROCESSING

The major objective of this research work (pre-processing stage) is to convert the user query in Telugu into the relevant English constructs. There are three distinct components that contribute to the success of the pre-processing. Figure 4.1 shows the overall process of query pre-processing.

Figure 4.1 Overall process of query pre-processing

4.2.1 Tokenizer

The user gives the query to the system. The tokenizer divides text into a structure of tokens. All contiguous strings of alphabetic characters are part of one token. Figure 4.2 shows the tokenizer component in pre-processing

Figure 4.2 Tokenization component

Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters. Figure 4.3 explains the working process of a sample user given Telugu query.

Figure 4.3 Tokenizer process

Steps in tokenization of user query are given below,

·  Segmenting Text into Words: The boundary identification is a somewhat trivial task since the majority of Telugu language characters are bound by explicit structures. A simple program can replace white spaces with word boundaries and cut off leading and trailing quotation marks, parentheses and punctuation. In figure 4.4 a sample text segmentation process is shown.

Figure 4.4 Simple Telugu sentence tokenization

·  Handling Abbreviations: In Telugu language a period is directly attached to the previous word. However, when a period follows an abbreviation it is an integral part of this abbreviation and should be tokenized together with it. Figure 4.5 shows the sample sentence with abbreviations is shown.

Figure 4.5 Tokenizer example

·  Numerical and special expressions are difficult to handle in Telugu language. They can produce a lot of confusion to a tokenizer because they usually involve rather complex alpha numerical and punctuation syntax. For this the blank spaces between the words are considered. In Figure 4.6 a sample example of special expression tokenization of query is shown.

Figure 4.6 Tokenizer example for special expressions

4.2.2 Language Grammar Rules

The tokens are sent to the language grammar rule component to process. The detailed flow of the grammar structure is explained in Appendix 1. In this sub section, the essence is explained briefly.

The essence of Telugu grammar is as follows.

·  It follows the Subject, Object and Verb (SOV) pattern.

·  There are three persons, namely, First person, Second person and Third person, Two way distinctions in Number namely Singular (Sg.) and Plural (pl.) and three way distinctions of Gender namely Masculine, Feminine and Neutral.

·  Feminine singular belongs to the Neuter and the Feminine plural belongs to the Human.

·  Apart from the three types of tenses, namely, Past, Present and Future, Telugu has one more special tense that is, the Future Habitual.

Figure 4.6 shows the language grammar rules component.

Figure 4.7 Language Grammar rules component

The grammar rules are used to preprocess the text. The idea is to identify the appropriate word sense in the text. This helps to avoid the issues of out of vocabulary text. If the user query is a complex one the reordered sentence will be sent to the morphological analyzer to identify the tense of a verb and inflections that are adding to verb. But the morphological structure of Telugu verbs inflects for tense, person, gender, and number. The nouns inflect for plural, oblique, case and postpositions. Figure 4.8 explains the working process of a sample user given Telugu query.

Figure 4.8 Grammar rules component process

The structure of verbal complexity is unique and capturing this complexity in a machine analyzable and generatable format is a challenging task. Inflections of the Telugu verbs include finite, infinite, adjectival, adverbial and conditional markers. The verbs are classified into certain number of paradigms based on the inflections.

For computational need In Telugu language there are 37 paradigms of verb and each paradigm with 160 inflections and sixty seven paradigms are identified for Telugu noun. Each paradigm has 117 sets of inflected forms. Based on the nature of the inflections the root words are classified into groups. An example is shown in Table 4.1.

Table 4.1 Sample Telugu sentence order

Sentence / దినేష్ పనికి వెళ్లతాడు.
Words / దినేష్ / పనికి / వెళ్లతాడు.
Transliteration / Dinesh / Paniki / veḷtāḍu
Gloss / Dinesh / to work / goes.
Parts / Subject / Object / Verb
Converted / Dinesh goes to work.

Telugu pronouns include Personal pronouns and Demonstrative, pronouns (The persons speaking, the persons spoken to, or the persons or things spoken about), Reflexive pronouns (in which the object of a verb is being acted on by verb's subject), Interrogative Pronoun, Indefinite pronoun, Demonstrative adjective and Interrogative adjective Pronouns, Possessive adjective Pronouns, Pronouns referring to numbers and Distributive Pronouns.

Telugu language uses postpositions for word in different cases. With the use of postpositions, there are eight possible cases (vibhakti) is shown in Table 4.2.

A noun in Telugu is the markings of gender, number, person and case makers are identified in three noun distinctions indicating: Human male/females, singular/ plural and non-humans. For the noun denotes human male it should end with inflection “-du” and for the human females it ends with “-di”.

In number marking on noun cases it occurs in singular and plural. In case of large number of nouns the form of the plural inflection is “–lu”, while in case of some nouns of human male category, the form of plural suffix alternant is “–ru”. For gender number person marking on nouns is explicit only in 1st and 2nd person in both singular and plural cases. Telugu language uses a wide variety of case markers and post-position suffixes are those which express grammatical case relations such as nominative, accusative, dative, instrumental, genitive, commutative, vocative and causal.

Table 4.2 Post positions for Telugu sentence order

Telugu / English / Significance / Usual Suffixes / Transliteration of Suffixes
Panchami Vibhakti (పంచమీ విభక్తి) / Ablativeof motion from / Motion from an animate/inanimate object / వలనన్, కంటెన్, పట్టి / valanan, kaMTen, paTTi
Dviteeya Vibhakti (ద్వితీయా విభక్తి) / Accusative / Object of action / నిన్, నున్, లన్, కూర్చి, గురించి / nin, nun, lan, kUrci, guriMci
Chaturthi Vibhakti (చతుర్థి విభక్తి) / Dative / Object to whom action is performed, Object for whom action is performed / కొఱకున్, కై / korakun, kai
Shashthi Vibhakti (షష్ఠీ విభక్తి) / Genitive / Possessive / కిన్, కున్, యొక్క, లోన్, లోపలన్ / kin, kun, yokka, lOn, lOpalan
Truteeya Vibhakti (తృతీయా విభక్తి) / Instrumental, Social / Means by which action is done (Instrumental), Association, or means by which action is done (Social) / చేతన్, చేన్, తోడన్, తోన్ / cEtan, cEn, tODan, tOn
Saptami Vibhakti (సప్తమీ విభక్తి) / Locative / Place in which, On the person of (animate) in the presence of / అందున్, నన్ / aMdun, nan
Prathama Vibhakti (ప్రథమా విభక్తి) / Nominative / Subject of sentence / డు, ము, వు, లు / Du, mu, vu, lu

A verb in Telugu sentence is a finite or non-finite verb which occurs according to the situations like rising pitch, meaning question, level pitch, falling pitch, and meaning command. In Telugu all verbs have finite and non-finite forms.

A finite form is one that can stand as the main verb of a sentence and occur before a final pause (full stop) and a non- finite form cannot stand as a main verb and rarely occurs before a final pause. There are eight finite rules for Telugu verb arranged in three verbal structures: stem or inflection root, tense mode suffix and personal suffix. These rules are discussed below in table 4.3 for a verb “ఆట్లాడు” (playing) with a root word “ఆట్ల” (play).

Table 4.3 Finite verb rules

Type / Structure / Rule / Example
Inflection or Stem root / (Rule 1) Imperative / Singular –du / atla –du
Plural –andi / atla –andi
Tense – mode suffix / (Rule 2) Admonitive or abusive / kAlu (to burn), kUlu (to fall), pagulu (to break) / In this case due to semantic restrictions, many verbs cannot occur in this mode
(Rule 3) Obligative (in all persons) / -Ali / atlad –Ali (I, We, You) (singular, plural)
Personal suffix (es) / (Rule 4) Habitual- future or non-past / -ta- / atla – ta – Am (we shall play)
atla – ta – Adu (He shall play)
atla – tun – di (she will play)
atla – ta – Anu (I shall play)
atla – ta – Ava (you will play)
atla – ta – Ay (they play)
atla – ta – Aru (they will play)
(Rule 5) Past tense / -din- / atla – din – Anu (I played)
atla – din – Ava (you played (Singular))
atla – din – Aru (you played (plural))
atla – din – Am (we played)
atla – din – Adu (he played)
atla – din – di (she/ it played)
atla – din – Aru (they played)
(Rule 6) Hortative / -da- / atla – da – tAm (let us play, or we shall play)
(Rule 7) Negative tense / -data- / atla – data – va (you (do, did, and shall) not play)
atla – data – Du (he (does, did, and shall) not play)
atla – data – nu (I (do, did, and shall) not play)
atla – data – m (we (do, did, and shall) not play)
atla – data – ru (they (do, did, and shall) not play)
atla – data – du (she/ it (do, did, and shall) not play)
(Rule 8) Negative imperative or prohibitive / -Ak- / atla – Ak – andi (you (plural) don’t play)
atla – Ak – u (you (singular) don’t play)

In the same way Non Finite Verbs are ten verbs which may be arranged into two structural types like Unbound and Bound and this rules are shown in Table 4.4 non-finite verb rules.

Table 4.4 Non-finite verb rules

Type / Structure / Rule / Example
Bound type / (Rule 9) Present / -ta-un- / atladu- ta- unnAnu (I am playing)
atladu - ta- un- nA (even playing (now))
atladu - ta- un- tE (if playing)
atladu - ta- un- na (that playing)
Unbound type / (Rule 10) Concessive / -dinA / atla- dinA (even though played)
(Rule 11) Conditional / -itE / atla- itE (if played)
(Rule 12) Present participle / -dutu / atla- dutU (playing)
(Rule 13) Past participle / -di / atla- di (having played)
(Rule 14) Infinitive / -ta / atla –ta (to play)
(Rule 15) Past adjective / -dina / atla- dina (that played)
(Rule 16) Negative adjective / -dani / atla- dani (not played)
(Rule 17) Negative participle / -aku / atla- aku (not playing)
(Rule 18) Habitual adjective / -dE / atla- dE (that plays)

The subject, object, verb and inflection are identified using the above grammar rules.

4.2.3 Bilingual Ontology

The terms are looked into the ontology for the English equivalent terms. The bilingual ontology for information retrieval is constructed based on the English Telugu language vocabulary relationships. In this research work ontology is a key element for the pre-processing of the query and the post-processing of the results. Block diagram of bilingual ontology component is shown in the Figure 4.9.

Figure 4.9 Ontology Component

Ontology may take a variety of forms, but necessarily it will include a vocabulary of terms, and some specification of their meaning. It includes the definitions and an indication of how the concepts are inter-related which collectively impose the structure on a domain and constrain the possible interpretations of the terms. Figure 4.11 illustrates the workflow of bilingual ontology component in the preprocessing stage for the CLIR and it also shows the connecting relationship of ontology terms.

Figure 4.10 Process flow of bilingual ontology component

Firstly, the English terms are mapped with Telugu terms, which come from Telugu English bilingual dictionary, Consequently, English Telugu ontology may contain terms that do not appear in the original Telugu English bilingual dictionary, or vice versa. It compares the number of terms in both versions. The termNs that do not appear in both languages are considered as Out Of Vocabulary (OOV) terms. The result of the alignment is the term list which is treated as the basis for extension of ontology. Each Telugu term in the list is considered as a seed term, which is used to search for Telugu synonyms online.

Secondly, the search engine is used to retrieve results in Telugu for each Telugu term, which are assumed to contain candidate Telugu synonyms. Thirdly, Telugu translations of terms are extracted from the retrieved results using sequential application of the following: a) linguistic rules, which provide the text segments potentially containing translations; b) mutual information filtering, which refines the candidate translations. Fourthly, the frequencies of each English term and Telugu translation in the results retrieved by search engine are calculated; and term weights are computed using these frequencies.

Figure 4.11 Ontology Relationship Hierarchies

Finally, the aligned term pairs, the English translations, term weights, and the ontology entry terms are merged according to the ontology hierarchy, forming the Telugu English bilingual ontology. The order of displaying the suggestions is shown below in Figure 4.12 the meaning, relationship terms, and related terms are expanded in the order and shown to the users.