An alternative approach to produce indicators of languages in the Internet
Daniel Pimienta
Observatory of languages and culture in the Internet
http://funredes.org/lc
World Network for Linguistic Diversity
http://maaya.org
ABSTRACT
Given the difficulty in obtaining reliable data about the space of languages in the Internet, an alternative approach is presented to compute indicators of languages behavior in the Internet, for the 140 languages with more than 5 million speakers. The approach is based on the collection of a series of micro-indicators that measure languages or countries in various Internet spaces or applications. A weighting method transforms world percentages by country into world percentage per language. Five indicators of languages in the Internet are defined: Internet users, traffic, use, contents, societal indexes and interfaces, and 4 macro-indicators are deduced, for each language: power, capacity, gradient and content productivity. The values of these indicators are calculated from the processing of a set of 369 micro indicators. The article shows by examples how much the existence of biases have marked the history of linguistic cybermetrics. Coherently, all possible biases derived from the method, assumptions or sources are discussed and finally an estimate is proposed that takes into account those biases. Recommendations are made for linguistic cybermetrics to comeback where it should stay: in academia and research realms. Credits belong to the International Organization of Francophonie (OIF), which funded a series of studies on the place of French on the Internet which allowed the development of this method and to Daniel Prado, who first conceived the idea of collecting diverse sources for measuring languages in the Internet as well as to transform figures per country into figures per language. All the experiments conducted with the developed benchmark demonstrate its ability to play a role of modeling and simulation tool, and also to be use for prospective scenarios. Critical comparisons are made with the rare existing indicators (W3Techs and InternetWorldStats).
Contents
1. Introduction 5
2. Methodology 6
2.1 Language's indicators in the Internet 6
2.2 Micro-indicators by language or country 8
2.2.1 By language 8
2.2.2 By country 9
2.3 Sources of information 9
2.4 Demo-linguistic data 10
2.4.1 The case of mother tongue (L1) 10
2.4.2 The case of second language (L2) 11
2.5 Process 12
2.5.1 Extrapolation 13
2.5.2 Process for indicators 14
2.5.3 Process to achieve differentiated results by theme 17
2.5.4 Types of weightings used 17
3. Results 18
4. Results analysis 20
4.1 About Wikipedia 20
4.2 Comparison of results with those of InternetWorldStats (IWS) 20
4.3 Sensitivity factors 23
5. Methodological limitations, bias analysis and controls carried 23
5.1 Biases is all! 23
5.1 Limitations/biases proper of the method 25
5.2 Languages 26
5.2.1 Selection of the source for the calculation of L1 26
5.2.2 The case of L2 26
5.2.3 Reducing the number of languages 27
5.2.4 Checking the invariance of results to the number of languages 28
5.3 Countries 28
5.4 Sources 29
5.4.1 Basics 29
5.4.2 Exceptions to the basic principles 29
5.4.3 The question of dates 29
5.4.4 The question of the meaning of the country > language transformation 30
5.4.5 Limitations due to sources 30
5.4.6 Potential bias from Alexa.com and W3Techs 31
5.4.7 Fixing the biases of W3Techs 35
5.4.8 Limitations / bias related to the degree of locality sources 37
5.4.9 About the principle of weighting 38
6. Conclusions and perspectives 40
7. Bibliography 41
Appendix I. List of micro indicators 42
Appendix II. selected sources 44
Appendix III: Values chosen for L2 45
Annex IV: Strongly local sites 47
LIST OF TABLES AND FIGURES
TABLES
TABLE 1: Description of indicators 14
TABLE 2: Description of macro-indicators 15
TABLE 3 : The 3 types of weightings used. 17
TABLE 4 : Indicators for the top 15 languages in terms of power 18
TABLE 5 : Languages sorted by capacity 19
TABLE 6 : Languages sorted by percentage of people connected 19
TABLE 7 : Languages sorted by gradient 19
TABLE 8 : Ratios Wikipedia articles / Internet 20
TABLE 9 : Differences with IWS in assumptions for L1+L2 21
TABLE 10 : InternetWorldStats figures (June 2016) 21
TABLE 11: Comparison IWS vs. Results of the study reduced to 100% 22
TABLE 12 : Simulation with data from IWS 22
TABLE 13 : Languages distribution in the Web as stated by Inktomi (2000) 24
TABLE 13 : Connecting Country 28
TABLE 14 : Ratios traffic / subscribers 33
TABLE 15 : Speculative ranking 36
TABLE 16 : Distribution of language locality websites 38
TABLE 17 : Simulation for interfaces 39
TABLE 18 : Simulation for index 40
FIGURES
FIGURE 1 : Languages on the Internet Indicators 8
1. Introduction
During the period 1998-2007, the author and Daniel Prado collaborated from their respective institutions, Networks & Development Foundation (FUNREDES) and Union Latine, for the design of methods for measurement of language's in the Internet which could provide reproducible and reliable indicators (see [6]). At the same time other initiatives[1] existed with the same objectives. From 2007, changes in the size of the Web and search engines behaviors has rendered obsolete the methods and created a vacuum in the production of indicators of languages in the Internet[2]. Between 2010 and 2012, under the leadership of the author, FUNREDES and MAAYA proposed launching an ambitious research project with the aim to fill that void. UNESCO, OIF and the Union Latine have joined this project and a consortium of prestigious European research institutions was formed to target funding from the EU Research Framework Program (project DILINET[3]). However this effort could not succeed to secure funding, despite the persistence and quality of the research teams involved (two attempts were made in 2012 and 2013, in response to European calls[4] and a final attempt with Qatari partners aborted in 2014).
To fill that void, a more pragmatic and much less expensive, though less ambitious, new method, based on the observation of language's behavior in a wide variety of spaces and applications of the Internet was proposed by Daniel Prado, in 2012, and opened a new collaboration with the author, under the MAAYA institutional hat and with the support of OIF. Two early studies helped provide results in terms of rankings for French in the Internet. The second, conducted in 2013, fed the Internet chapter of the 2014 report "Le français dans le monde" (see [3]) and was followed by a similar study of Spanish in the Internet (see [1]),. The latest OIF study, more ambitious, which inspires this article, managed, by the application of a statistical approach, authorized by the increased number of sources, to achieve results in terms of language indicators in the Internet for a wide range of languages.
The method is based on collecting quantitative information about language use in many applications and Internet spaces. The compilation and organization of sources enable the measurement of the presence of languages in the Internet and put the results into perspective by building a series of indicators that measure the corresponding share of each language in the Internet, in terms of Internet users, traffic generated, use of services, content, interfaces to software and translation languages and based on a series of indexes that evaluate societal criteria in relation with the Information Society. A summary of the presence of languages in the Internet is produced in the form of a series of macro-indicators, which gather all parameters so to compute, for each language:
ü its power in the Internet,
ü the capacity of its speakers,
ü the gradient influence of its connected speakers,
ü the productivity of its connected speakers in terms of content production.
The thematic groupings of micro-indicators would further differentiate the potential of a given language as regard to these themes and thus give some highlights on the strengths and weaknesses of this language in the Internet[5].
The methodological framework is to use as many sources available to quantify the role of languages in the Internet, either directly when figures concerning the language are available, which is unfortunately rare, or indirectly, using figures per country and transforming them into figures per language[6].
This transformation from country related figures into language related figures makes this method an original approach that is unprecedented and gives it the ability to handle the language issue in the Internet, in a context where language indicators have become, at best, highly unreliable but mostly and usually nonexistent.
This approach is supported by implicit assumptions that need to be made explicit and evaluated, for a number of precautions must be taken to ensure consistency and reliability. The discussion on the limits and controls that have been made to ensure the reliability of a method that involves some complexity, both in the calculations made and in understanding the concepts that result, therefore occupies a significant part of this article.
2. Methodology
2.1 Language's indicators in the Internet
Practices or uses in the Internet relate to applications (e.g. Google search engine or Facebook social network) or spaces (e.g. smart phones or e-government). When reliable quantitative sources can be identified, micro-indicators of the place of languages in the Internet will be defined in relation to the application or the space. Statistical methods (different types of weighted means, truncated means or simple means) are then deployed to construct indicators. Six indicators were thus identified that measure the global share[7] of each language according to the characteristic elements of the Internet:
· Internet users (Internet connected persons) that relates to the speakers of each language with have access to the Internet. A single micro-indicator (offered by ITU) answers that need and will serve as a fundamental source for the remaining work.
· Usage: Relates to subscriptions to applications or to means of connection to the Internet. Eleven micro-indicators are involved in the construction of this indicator.
· Traffic: Indication of the traffic generated by users to applications. Three hundred and sixteen micro-indicators are used to construct this indicator.
· Indexes: Relates to country rankings in various aspects of the information society. Five micro-indicators are currently used to construct this indicator.
· Contents: Relates to contents on the Web for each language and which, for the moment, mainly gathers data from the Wikimedia galaxy. Thirteen micro indicators provide data for this indicator.
· Interfaces and translation of languages: refers to the presence of languages in interfaces to applications or as translation language. Twenty three micro-indicators build this indicator.
Finally four macro-indicators of the presence of languages in the Internet express the synthesis of all:
Ø a macro-indicator of the power of language in the Internet, which measures the global share of the language in the Internet, average of the six previous indicators;
Ø a macro-indicator of the capacity of the language in the Internet as measured by the ratio between the power and the percentage of the world population in that language;
Ø a macro-indicator of the language gradient in the Internet as measured by the ratio between the power and the percentage of people connected to the Internet.
Ø A macro-indicator of productivity of the language in the Internet in terms of content creation, which is measured by the ratio of percentage of content in that language and the percentage of Internet users in the same language.
The power indicator is expressed as a world share. The other three indicators are dimensionless and normalized to 1. The concept described is as strong (weak) as the value is higher (lower) than 1.
The statistical processing will be based mainly on the only source regarded as both reliable and essential, that provided by the ITU, which expresses for each country the percentage of people connected to the Internet. Reading the article will show that this data is involved in a large number of operations to be performed, in particular weighting.
The following diagram shows all the indicators which are processed for each language.
FIGURE 1 : Languages on the Internet Indicators
All the micro-indicators are presented in Appendix 1. The indicators and macro-indicators are detailed in a table in section 4.5.3 Treatments for results.
2.2 Micro-indicators by language or country
A set of 369 micro-indicators represent the data source for this edition of the study. Some (36) directly on languages in the Internet, other (333) for countries.
2.2.1 By language
The micro-indicators by language relates to contents (13) and interfaces (23) and are 36 in number in this edition.
The sources of content are expressed in terms of units per language (e.g. number of Wikipedia articles per language) or world percentage (e.g. the world percentage per language of books sold by Amazon[8]). In the first case, the value is converted into world percentage by dividing by the total number. The content indicator is calculated as the truncated mean at 20%[9] of the 13 micro-indicators for content (relative to the number of books per language in Amazon, W3Techs[10] and 11 in conjunction with Wikimedia).
The interfaces and translation languages are denoted by a binary number expressing the absence (0) or existence (1) of an interface to the application in the language or as part of the translation languages. There are 23 applications indicated. The interface indicator shows initially the percentage of the presence of the language in all applications then this measured value is transformed into world percentage by a weighting relative to the percentage of Internet users in each language.
2.2.2 By country
Over 90% of micro-indicators in this edition of the study was obtained from sources that provide information by country, which can be reported in several ways:
ü in amounts (e.g., the number of Internet users per country);
ü in national percentages (e.g. the percentage of consultation to Facebook by country, or the percentage of mobile Internet traffic carried in each country);
ü world percentages (e.g. the distribution of world traffic by country to the site Facebook.com);
ü notation on a predetermined scale (in the case of indexes that provide ratings for countries according to criteria, such as WebIndex[11] providing country index about the information society taking values from 0 to 1 or 0 to 100, as applicable);