Preprocessing Arabic Text For Natural Language Processing Applications

A Rule-Based Morphological Analyzer of Arabic Words

ARAFAT AWAJAN

Computer Science Department

Princess Sumaya University College for Technology

Royal Scientific Society

Amman - JORDAN

Abstract: - This paper describes a rule-based technique for analyzing the morphology of Arabic words. The proposed ‘Morphological Analyzer’ processes the input word in order to determine its lexical form. The lexical form of the majority of Arabic words consists of a root and a morphological pattern. The analyzer applies a set of predefined rules in order to analyze the morphology of Arabic words as they appear in real text. It is able to recognize diacriticized, undiacriticized or partially diacriticized Arabic words generated from N-letter roots. In order to determine the possible meanings of a word, the Morphological Analyzer also provides some useful attributes of the word such as its type, gender, tense and number. The proposed Morphological Analyzer is a general-purpose technique that can be integrated into larger scale systems such as automatic translation applications, text summarization applications, text correction applications, web search engines, automatic vowelization of Arabic text applications and other natural language processing applications.

Keywords: - Natural Language Processing, Arabic Word Recognition, Lexical Form, Roots, Morphological Patterns, Morphological Analyzer.

1. Introduction

For the Arabic language, as well as for many other languages, the morphological features of a word provide crucial information to enable understanding of text and information extraction. In fact, the possible meanings of individual words depends mainly on their morphology and their position in a sentence. Therefore, the possible meanings of a word must be determined first in order to accomplish the understanding of text written in a natural language.

A number of research papers concerning the morphological analysis of words have been published for various natural languages, particularly the European languages [2] [5] [8] [6] [9]. Descriptions of real systems for analyzing the morphology of these languages are also available [7]. These works show that the complexity of the morphological analysis of words varies from one natural language to another. There have been fewer articles and published research papers written on the subject of the morphological analysis of Arabic words. Most of these published works ignore the presence of diacritics in the Arabic text or limit the analysis to words generated from 3-letter roots [1] [4] [3].

The rule-based Morphological Analyzer presented in this paper has the objective of finding the lexical form and the possible meanings of each word in a text written in Arabic language. The proposed analyzer is being developed in order to analyze Arabic words as they appear in real text. It can be applied in the case of diacriticized, undiacriticized or partially diacriticized Arabic words. Furthermore, it allows the morphological analysis of words generated from variable root-lengths.

Our approach is based on the use of the specific features and structures that the Arabic language uses for generating words. It applies a set of predefined rules specific to the Arabic language in order to extract the lexical structure of the word which generally consists of a root and a morphological pattern. A classical lexicon is used to verify the correctness of the analyzed word and to determine the meanings it could take.

The morphological information that our technique is able to extract gives vital support to the different fields and applications of Natural Language Processing. The purpose of the Morphological Analyzer is to preprocess Arabic text in order to prepare it for some automated treatment such as human-machine interaction, translation, text summarization, text correction, automatic vowelization of Arabic text, web search engines and other applications of natural language processing.

2. The Morphology of Arabic Words

In many European languages words are constructed from basic units called morphemes by adding a suffix and prefix. A morpheme is the primitive unit of meaning in a language. For example, the meaning of the English word ‘friendly’ is derivable from the meaning of the noun ‘friend’ and the suffix ‘–ly’ that transforms a noun into an adjective [3]. In such cases the morphological analysis is based on the elemination of affixes and the extraction of the basic morpheme of a word. Special treatment is always considered to deal with the irregularities present in almost all the natural languages.

The morphology of the Arabic language is based on the Semitic root-and-pattern scheme of forming words. Therefore, the majority of words are generated from basic entities called roots or radicals according to a predefined list of patterns called morphological balances or patterns [1] [4] [3]. The roots are constructed mainly from 3 letters, although 4 and 5-letter roots exist too. The morphological patterns represent the major spelling rules of Arabic words. This mechanism of Arabic word generation is called ‘AL-ISHTIQAQ.’ This mechanism is performed by adding letters and/or diacritical marks to the roots. These additional letters and diacritical marks may be added at the beginning, at the middle or at the end of the root. In this paper, a morphological pattern is represented by the additional parts, their positions and the slots where the letters of a root can be inserted. The character “*” represents the slots of the root’s letters. Figure 1 contains examples that illustrate the “AL-ISHTIQAQ “ mechanism, it presents words generated from the same root “ K T B “ according to different morphological patterns. It is important to note the role that diacritics play in fixing the meaning of the first and second words of Figure 1.

Figure 1. AL-ISHTIQAQ Mechanism

All classifications of words (verbs, nouns, adjectives and adverbs) can be generated from roots according to the appropriate patterns. The pattern used for generating a word determines its various attributes such as gender (masculine/feminine), number (singular/plural), tense (past, present, and imperatives), mode etc. Figure 2 presents an example that shows the importance of the standard Arabic morphological patterns in fixing the meaning of a word.

Based on the above, an Arabic word can be represented lexically by its root, along with its morphological pattern. The latter is one element of a countable set of limited size. A pattern is defined by a set of additive letters and/or a set of diacritical marks and their positions in the generated word.

3. Arabic Language Features and Challenges

The formation of Arabic words presents specific features and challenges that must be taken into consideration when fixing the rules used by the morphological analyzer. The first challenge is that some letters of the root may be dropped or modified during the generation of words from roots. The analyzer has to rebuild the original root-letters by retrieving the missing or modified letters of the word.

The second challenge is the presence of eight different types of diacritical marks, used to represent short vowels. In written text they are considered as special letters where each one is assigned a single code, as with normal letters. In fully diacriticized text a diacritical mark is added after each consonant of the word. These diacritical marks play a very important role in fixing the meaning of words. In fact, two different patterns may have the same sequence of consonants, but one is distinguished from the other solely by the diacritical marks. These marks are classified into the following categories: [1]

· Three diacritical marks to indicate the short vowels (َ ُ ِ ),

· Double diacritical marks which combine the single ones ( ً ٌ ٍ ),

· Single Diacritical mark to indicate absence of vowelization ( ْ ),

· A single diacritical mark to indicate the duplicate occurrence of a consonant ( ّ )

According to the extent that diacritics have been used, Arabic text may be classified into three different categories:

undiacriticized, partially diacriticized, and fully diacriticized text. The first category represents text without diacritics such as typed or printed text and newspapers. The second category represents partially diacriticized text where diacritical marks are added to eliminate the ambiguities of some words. The last category represents fully diacriticized Arabic text, according to which every consonant is followed by a diacritical mark. Such a format is used for writing the Holy Koran, classic Arabic literature and children’s educational books..

The third challenge is that not all the words in Arabic text are generated from a root. For example, some words such as the tools and foreign words cannot be broken down into a root and pattern. As the number of tools is limited, a table of these predefined tools can be used to check whether a word is a tool or not before sending it to the analyzer. Meanwhile the ‘loan’ or foreign words, are listed in the lexicon and need not undergo morphological analysis.

Figure 2. Role of the Morphological Pattern of an Arabic Word in Fixing its Meaning

4. The Morphological Analyzer

The Morphological Analyzer of Arabic words (MAAW) processes each word of the input text in order to determine its root and pattern. The results of the morphological analyzer can be used for further analysis. Figure 3 presents these transformations schematically.

The identification of the morphological structure of a word depends on a rule-based system that can find the morphological pattern for diacriticized or undiacriticized words. To achieve this process, we assume that a diacritic follows each letter of the word. If a diacritic is omitted, it will be replaced by a special character (EXTRA-SECOUN) that we introduce to replace the absent diacritic. This diacritic (EXTRA-SKOUN) will be noted by a dot in the examples of this paper.

A procedure ‘ Check_Diacritics’ takes the list of characters forming the word and checks for the presence of diacritics after each consonant. It replaces the absence of diacritical marks after a consonant by the mark EXTRA-SECOUN. A word is then represented by a list of character L according to the next format:

[C1 V1 C2 V2 . . . Cn Vn]

Where Ci is a consonant and Vi is a diacritical mark. Each one of the classical patterns is also represented by a list of the same structure where the slots of a root’s letters are marked by the character ‘*’. Figure 4 shows an example of a classical pattern representation.

To deal with the three possible situations of Arabic text (fully diacriticized, partially diacriticized and undiacriticized text), the list L will be further divided into two new lists. The first list LC contains the sequence of consonant [C1, C2, . . . Cn] and the second list LV contains the diacritical characters [V1, V2, . . . Vn]. Table 1 shows examples of the segmentation of words into consonants and diacritics. The three examples given in Table 1 share the same list of consonants LC.

Original Text

Morphological Features

Further Analysis (NLP Applications)

Figure 3. Morphological Analyzer

Figure 4. Pattern Representation

Word

/ Word Class / List of Consonants / List of Diacritics

يَـذْهَـبُـوْنَ

/ Fully diacriticized / [ ي ذ هـ ب و ن ] / [ َ ْ َ ُ ْ َ ]
يَـذهَـبـونَ / Partially diacriticized / [ ي ذ هـ ب و ن ] / [ َ . . َ . َ ]
يـذهـبـون / Undiacriticized / [ ي ذ هـ ب و ن ] / [ ...... ]

Table 1. Decomposition of Words into a List of Consonants and a List of Diacritics

The list of consonants (LC) represents the letters of the word’s root, and the suffixes, infixes and prefixes used to form the word according to a given pattern. In order to extract the root of a word, the list LC can be represented by the following general description:

[X1[X2[X3]]] R1 [Y1] R2 [Y2] R3 [ [Y3] R4 [[Y4] R5]] [Z1[Z2[Z3]] ]

where components X1X2X3 represent a prefix of 3 letters maximum, the components Z1Z2Z3 represents a postfix of three letters maximum, and components Y1Y2Y3Y4 represent the possible infixes of four letters maximum. The slots R1, R2, R3, R4, and R5 represent the letters of the root used to generate the word. The characters [ ] are used here to indicate that the included component is optional. This representation allows us to manipulate all kind of roots (3-letters roots, 4-letters roots and 5-letters roots). Table 2 gives examples of the above representation. The first two words are generated from two different three-letter roots according to the same morphological pattern, they share the same additive parts (prefix, infix and postfix). The last three words are generated from the same root according to different patterns.

The morphological patterns will also be segmented into two lists: LC and LV. For example the pattern presented above in Figure 4 can be broken down into two lists: a list of consonants (LC) and a list of diacritical marks (LV) [Figure 5]. The separation of consonants and diacritics significantly reduces the number of patterns to be tested.

Input Word

/ List of Consonants / Root R1R2R3 / Prefix X1X2X3 / Infix
Y1 Y2 / Postfix Z1Z2Z3

سيـذهبـون

/ [ س ي ذ هـ ب و ن ] / [ ذ هـ ب ] / [ س ي ] / [ ] / [ و ن ]

سيـدرسـون

/ [ س ي د ر س و ن ] / [ د ر س ] / [ س ي ] / [ ] / [ و ن ]
دارسـون / [ د ا ر س و ن ] / [ د ر س ] / [ ] / [ ا ] / [ و ن ]
مـدرسـون / [ م د ر س و ن ] / [ د ر س ] / [ م] / [ ا ي] / [ ]

Table 2. Decomposition of the List of Consonants