INVESTIGATING SIMILARITIES IN WORDCOUNT FOR ENGLISH AND MALAYALAM LANGUAGE BY USING MULTILINGUAL WIKIPEDIA

A study submitted in partial fulfillment

of the requirements for the degree of

Master of Science in Information Systems

at

THE UNIVERSITY OF SHEFFIELD

by

JIMMY K. JOSE

September 2009

Abstract

Background:Wikipedia is a free,open and collaborative nature multilingual encyclopedia with enormous amount of excellent articles. The rapid growth of Wikipedia is creating parallel and comparable multilingual corpus in an unpredictable range (Adar,Skinner, Weld, 2009). Currently, there are 266 languages in Wikipedia and Malayalam is one among it.Wikipedia entries vary in their level of information due to independent development. Do these articles in Malayalam Wikipedia are equally informative as in English Wikipedia? Can these articles used for Natural Language Processing applications as parallel corpus or comparable corpus? Can these articles analyzed to generate a Machine Translation Evaluation metrics for Malayalam to English?

Aims:The purpose of this study is toanalyze and quantify the asymmetries between Malayalam and English Wikipediain order to come up with a ratio of word count between these two languages.

Methods:A framework to quantify the asymmetries and information overlap between the multilingual Wikipedia need to be created. Both quantitative and qualitative data is collected by visiting article in Malayalam Wikipedia and its corresponding English page.In order to prove the hypothesis, ratio of word count for comparable corpus and parallel corpus will be same.Similar articles in English and Malayalam are translated manually and ratio is calculated with respect to these outputs too, a dependent t-test is then conducted to prove the hypothesis.

Results:The ratio between word counts of an English article to Malayalam article in multilingual Wikipedia is 1.6 if the information overlap is 100%. The study also proved the hypothesis regarding the equality of word count in a comparable corpus and parallel corpus to be false.

Conclusions: The parameters that directly affect the word count of a sample test file effect the study. These factors would also have influence to result of the hypothesis. An efficient automated way to calculate the word count automatically is very important for the future study. An automatic translation for the respective language pair will also ease the study and improve the quality of results.

Acknowledgements

I would like to thank my supervisor Dr. Paul Clough for his guidance throughout my Masters dissertation.

Table of Contents

1. Introduction

1.1Aim

1.2Objectives

2. Literature Review

2.1 History of Wikipedia

2.2 Multilingual Wikipedia and Indian Languages

2.3 Malayalam Language, Origin and dialects

2.4 Wikipedia in Malayalam

2.5 Multilingual Wikipedia and NLP

2.6 Machine Translation

2.7 Machine Translation Evaluation Methods

3.0 Methodology

3.1 Research Design

3.2 Data Collection

3.2.1 Sampling methods

3.2.2 Dependent t-test:

3.2.3 Design of Scale

3.2.4 Structure of Wikipedia Pages

3.2.5 Structure of Data Sheet

3.2.6 Data collection method 1 (Informal)

3.2.7 Data collection method 2 (Census)

3.2.8 Data collection method 3 (Formal)

3.2.9 Data collection method 4 (Manual)

3.3 Data Analysis

3.3.1 Analysis of alphabetical distribution of data.

3.3.2 Analysis of data across the scale in alphabetical order.

3.3.3 Analysis of Ratio.

3.3.4 Analysis and identification of category

3.3.5 Analysis of data across category with respect to scale

3.3.6 Analysis of data using SPSS.

4.0Results

4.1 Alphabetical distribution of data in Malayalam Wikipedia.

4.2 Multilingual distribution of data

4.3 Distribution of pages according to category

4.4 Category wise multilingual distribution of pages

4.5 t-test result for parallel and comparable translation

5.0 Discussion

5.1 Summary and discussion regarding multilingual data distribution based on alphabets

5.2 Summary and discussion regarding multilingual data distribution based on category

5.3 Discussion about the multilingual ratio and t-test outcome

6.0 Conclusion and future works

Bibliography

Web sites

Appendix A – Data collection method 1 (Informal)

Appendix B - Data collection method 2 (Census)

Appendix C – Structure of CD

1. Introduction

The Oxford English Dictionary (Ostler, 1969), describes Encyclopedia as a book of information on all branches of knowledge. The growth of technology and internet transformed physical libraries into digital format there by allowing information to be accessed beyond the boundaries.

Wikipedia is an online,open, free multilingual encyclopedia created by the collaborative effort of people for the people.“Success of Wikipedia results from a wisdom of crowds, it was driven by the influence of elite users in its early stage, but now the common users drives it.” (Kittur, Pendlenton, Chi, Suh, Mytkowicz, 2006).People like Keen (2008), believe the information in Wikipedia isnot trustworthy since “the distinction between trained expert and uninformed amateur becomes dangerously blurred”.Researchers like Filatova (2009), “believes Wikipedia itself has resources to increase its trustworthiness and it’s containing reliable information from the wisdom of independent users”. Despite of Keen’s arguments on trustworthiness of Wikipedia, people use it to perform crosslanguage information retrieval, cross language QA and summarization for biography generation (Baidsy,Hirschberg,Filatova, 2008).

An ideal multilingual encyclopedia is supposed to have the same information across all the language but Wikipedia is different in that case. Article in Wikipedia are created independently by different authors in different language.But Wikipedia has gained popularity because there is at least something of many things (Sanger, 2005). According to Clough and Eleta (2009), “Language represents a clear barrier to accessing information online”.Cross lingual summarization(Adar,Skinner, Weld, 2009) would not only enable to improve the information across multilingual Wikipedia but also the trustworthiness of information across different regions.Language barriers could be reduced by the improvement in cross language information retrieval, cross language summarization, cross language QA and cross language summarization of biographies.

1.1Aim

The aim of this study is toanalyze and quantify the asymmetries between Malayalam and English Wikipedia in order to come up with a ratio of word count between these two languages.Filatova (2009) says, “These asymmetries can be used by NLP researchers for training summarization systems, and contradiction detection systems”. The outcome of this study can be used to refine the information arbitrage across Malayalam and English Wikipedia.

1.2Objectives

The following objectives will enable to achieve the aim and derive conclusions about the multilingual nature of Malayalam and English Wikipedia.

  • Create a framework to quantify the asymmetries and information overlap between the Malayalam and English Wikipedia.
  • Quantify and analyze the information arbitrage and information overlap across Malayalam and English Wikipedia.
  • Quantify and analyze the alphabetical frequency distribution of articles in the Malayalam Wikipedia.
  • Classify and analyze the categorical distribution of articles in Malayalam Wikipedia.
  • Prove the hypothesis: The ratio of word count between a parallel corpus and comparable corpus is the same.

2. Literature Review

2.1 History of Wikipedia

Wikipedia is an open, free and collaborative nature encyclopedia with enormous amount of excellent articles built using Wiki technology. “Wiki technology was introduced in 1995 by Ward Cunningham as a programming language pattern, allowing people to change the data which they are viewing“(Leuf, Cunningham, 2001;Almeida, Mozafari, Cho, 2007).

Nupedia, the predecessor of Wikipedia was a highly reliable peer reviewed website with an effort of subject area experts and public. In order to boost the growth of Nupedia, Wikipedia has being created as a feeder on January 15, 2001 by Jimmy Walesand Larry Sanger. According to Sanger (2005), “Wikipedia is getting high profile, international recognition as a new way of obtaining at least a rough and ready idea about many topics”. Wikipedia grew faster than Nupedia because of the features like open licence, focus on encyclopaedia rather than a dictionary, ease of editing, neutrality and radical collaboration (Sanger, 2005).

Life of Wikipedia

2001: Wikipedia was formally launched in January 15, 2001 by Jimmy Walesand Larry Sanger. The number of articles in English grew up to 16442 by the end of the year. German and French Wikipedia were also introduced there by gaining the multilingual nature.

2002:Wikipedia stopped commercial advertising and Wiktionary was launched as a sister project. The number of article in Wikipedia grew rapidly to hundred thousand by the end of the year.

2003:Wikimedia foundation was formed on June 20, 2003 by Jimmy Wales. The number of article by the end of the year reached was nearing two hundred thousand.

2004:The articles doubled in size and span across 100 languages. The English Wikipedia reached around five hundred thousand articles by the end of the year.

2005:Wikipedia launched multilingual subject portals. A new feature known as semi-protection was launched to review the biographies of living people. The number of articles reached a million by the end of the year.

2006:Wikipedia reached 1.5 million articles by the end of 2006 and doubled the multilingual nature too.

2007:Wikipedia reached 7.5 million articles in 250 languages. English Wikipedia reached 2 million articles by the end of this year.

2008:Wikipedia reached 2.5 million articles by the end of the year.

2009:Wikipedia reached one billion articles in 271 languages and 3 million articles in English.

All the numbers mentioned above are based on Wikipedia Statistics.1.

2.2 Multilingual Wikipedia and Indian Languages

Miniwatts Marketing Group (2009) shows the statistics of internet users according to geographic locations. Nearly half (41.2%) of the current internet users are from Asia, even though the average penetration into population is just 17.4%. Figure 2.0 is a pie chart describing the distribution of Internet users on the globe.

Figure 1.0 World Internet user

Asia had a growth of 475% in the number of users for the past 8 years (2000-2008). Countries like China, Japan and India dominate internet usage in the Asian Continent. India has 81 million internet users which count to 12% of Internet users in Asia. According to Census of India (2001) the total population of India is above one billion (1,028,737,436).Nearly 10% of the total India population would be Internet users.According to the eighth schedule of The Constitution of India (2007) there are 22 scheduled languages. 17 out of the 22 scheduled languages are currently present in Wikipedia as on 14th August 2009. Table 1.0 shows the 22 scheduled languages, number of speakers in millions, presence ofregional Wikipedia and number of articles currently available.The numbers of articles when compared to the number of speakers are very less due to the low internet penetration.

No / Language / Speakers as of Census 2001 in million / Number of Articles / Present in Wikipedia
1 / Assamese / 13 / 254 / Yes
2 / Bengali / 83 / 20135 / Yes
3 / Bodo / 1.2 / 0 / No
4 / Dogri / 0.1 / 0 / No
5 / Gujarati / 46 / 8078 / Yes
6 / Hindi / 422 / 36502 / Yes
7 / Kannada / 38 / 6965 / Yes
8 / Kashmiri / 5.5 / 371 / Yes
9 / Konkani / 2.5 / 0 / No
10 / Maithili / 12 / 0 / No
11 / Malayalam / 33 / 10701 / Yes
12 / Manipuri / 1.5 / 2341 / Yes
13 / Marathi / 72 / 24170 / Yes
14 / Nepali / 2.5 / 2755 / Yes
15 / Oriya / 33 / 551 / Yes
16 / Punjabi / 29 / 1408 / Yes
17 / Sanskrit / 0.05 / 3879 / Yes
18 / Santali / 6.5 / 0 / No
19 / Sindhi / 2.5 / 337 / Yes
20 / Tamil / 61 / 19092 / Yes
21 / Telugu / 74 / 43580 / Yes
22 / Urdu / 52 / 10820 / Yes

Table 1.0 Scheduled languages in India

2.3Malayalam Language, Origin and dialects

Malayalam is the language of the people of Kerala, one of the states in India. It is one of the well developed languages classified under Dravidian family of languages in south India(George, 1971). Malayalam has 33 million speakers in Indiaaccording to Census of India (2001). Around 2 million Malayalam speakers are in the Gulf countries and around a million in Malaysia, Singapore, Australia, United States of America and United Kingdom.

Malayalam script is derived from Grantha script, a descendant of the ancient Brahami script(Philip,Samuel,2009).Malayalam is written horizontally from left to right and its basicset of symbols consists of 14 vowels and 36 consonants as shown in figure 2.0 and figure 3.0 respectively. Thecombination characters and extended characters put togethermake a total of 592 characters.

Figure 2.0Malayalam vowels

Figure 3.0Malayalam consonants

Kerala is well known for foreign invasion from very ancient times due to the presence of spices and easy accessibility through Arabian Sea.When one language comes in contact with the other for quite long time vocabularies are exchanged. According to George (1971), Malayalam has being fortunate enough to come in contact with many languages like Sanskrit, Persian, Arabic, Hindustani and European languages like English, Portuguese, Dutch and French. On a rough estimate Malayalam has borrowed 150 words from Portuguese, 30 from French and 10 from Dutch. The number of words that Malayalam has borrowed from English is so large because of the impact of western life and letters on Malayalam for the past 200 years. Not only words, the punctuation methods, abbreviation techniques and word differentiations are also similar in English and Malayalam.The Malayalam language keeps on evolving like any other language. There are two different styles for writing available in Malayalam known as,

  • old style of writing
  • new style of writing

The old style of writing used running letters and was a bit complicated since we could combine different words together to form a single word. The new style of writing avoids combining the word together there by reducing the complexity and increasing the word count.These similarities between English and Malayalam helped to reduce the complexity of the study.

2.4 Wikipedia in Malayalam

Malayalam Wikipedia is the 86th ranking Wikipedia as on 14th August 2009 with more than 10,000 articles. Malayalam Wikipedia was started on 21st December, 2002. The quality of a Wikipedia is known as depth and is calculated by the following equation identified by Wikipedia

((Edits/Articles) × (Non-Articles/Articles) × (Stub-ratio))

The depth of Malayalam Wikipedia is 172 and it stands second in 10000+ article categories. The total number of users in Malayalam Wikipedia is 12,480 and has 193 active users where as English Wikipedia has more than 3 million articles and 10 million users. That doesn’t mean that all the articles in Malayalam Wikipedia are present in English Wikipedia. The depth of English Wikipedia is the highest and considered to be most reliable amount the list of all Wikipedia.The table below compares different parameters of Malayalam and English Wikipedia.

No. / Language / Language (local) / Wiki / Articles / Users / Active Users / Depth
1 / English / English / en / 3 012 031 / 10 382 238 / 146 086 / 445
86 / Malayalam / മലയാളം / ml / 10 756 / 12 480 / 193 / 172

Table 2.0 Comparison between English and Malayalam Wikipedia

2.5 Multilingual Wikipedia and NLP

Most of the entries in Wikipedia have description in multiple languages. The information across these multilingual Wikipedia varies in size because they are independently created by different users.The regional influence of the subject mostly describes the difference in information.“The rapid globalization of Wikipedia is generating a parallel, multi-lingual corpus of unprecedented scale”(Adar, Skinner, Weld, 2009).Recent developments in Natural language processing made advantage of these differences to create multilingual corpus. There are mainly two types of corpora available

  • Parallel multilingual corpus
  • Comparable multilingual corpus

Parallel multilingual corpora:“Parallel corpora contain the same informationtranslated from one language into a set of pre-specified languages with the goal of preserving the information covered in the source language document”(Filatova, 2009).Parallel corpora is being utilised by most of the Machine translation approaches. Parallel corpora also known as Bilingual files are scanned through to make an appropriate translatedsegment.

Comparable multilingual corpora: “A comparable corpus is defined as a set of documents in one to many languages that are comparable in content and form in various degrees and dimensions”(Filatova, 2009). NLP community is studying the usefulness of comparable corpora for machine translation.

Adar,Skinner, Weld (2009), has developed a system known as “Ziggurat” to analyze the information in the Info Box across multilingual Wikipedia. This system is very helpful in creating newinfoboxes as necessary, filling in missing information, anddetecting discrepancies between parallel pages.

2.6 Machine Translation

Machine Translation is a computer aided natural language processing from one language to another. Machine translation systems can be considered as a compiler where the information in one language is parsed and checked for the correctness in the syntax. Once the syntactical analysis is over the source is broken down into small segments and checked against a multilingual dictionary or a database which hold similar segments and its corresponding target language equivalent. If the segments are available then, it’s being replaced in the intermediate file to generate a translated output target file. Machine Translation techniques could be used in the future to unify the information across multilingual Wikipedia. As well as Machine translation systems could use the multilingual information in the Wikipedia to improve their efficiency in translation. The following are some of the common approaches used in machine translation.

Rule based Machine Translation: Rule-based machine translation (RBMT) relies on built-in linguistic rules and bilingual dictionaries. The source text is parsed according to the stored rules and translated into target language with the help of bilingual dictionaries.

Statistical Machine Translation:“Statistical Machine Translation (SMT) is an approach to Machine Translation that is characterized by the use of machine learning methods”(Lopez, 2008).Statistical Machine Translation relies on multilingual corpus for translation. Large amount of data need to be stored through out the lifetime of the software. The program scans through the huge database to generate an output. The outputs generated by Statistical Machine Translation tools are having good readability since its not just replacing words from a multilingual dictionary.

Example based Machine Translation:“Example based Machine Translation (EBMT) is an approach which retrieves similar examples(Pairs of source phrases, sentences, or texts and their translations) from adatabase of examples, adapting the examples to translate a new input” (Sumita, Iida, 1991). This is an enhanced version of Statistical Machine Translation where pairs of source phrases or even the entire document replaced from the examples stores in the database.

2.7 Machine Translation Evaluation Methods

There are various means for evaluating the effectiveness of machine-translation systems. Human judges to assess a translation's quality is the best and the oldest. Even though human evaluation is time-consuming, it is still the most reliable way to compare translation quality of documents. Automated modes of evaluation include BLEU, NIST and METEOR.

BLEU (Bilingual Evaluation Understudy): BLEUis an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human. BLEU was one of the first metrics to achieve a high correlation with human judgments of quality, and remains one of the most popular.BLEU uses the average logarithm with uniform weights, which is equivalent to using the geometric mean of the modified n-gram precisions.