My Webpage can Speak Many Languages
Tomasz Müldner, Frank Wong and Darcy Benoit
Jodrey School of Computer Science, Acadia University
Wolfville, NS, Canada B4P 2R6
[My comments are written in these brackets]
[Note: My intention is that this document will be published as a Technical Report, and so its length is not a problem, and it does not have to be ready for the Edmedia deadline; instead say early January. We’ll use it as a working documentation of the project, for submitting a paper – before dec. 19. - to Edmedia (using only 8 pages from this text), and possibly for other future publications. To help you find more immediate tasks, in the text there are comments, such as [Frank EDM], which indicates that Frank should write it for the Edmedia’s deadline]
[to be written]
In the past, most software programs could only “speak one language”. ; fFor example, software developed in the UK could speak English, and while the same software developed in China could only speak Chinese. (To avoid possible confusion, inIn this paper, a language by a language we means a the natural language used in the Human-Computer Interface (HCI)., and uUnless it is clear from the context, we always say “a programming language” when we refer to a language used for programming.) Therefore, two or more versions of the same program that differ in HCI the Human-Computer Interaction (HCI) language used for the communication with the user might require completely different implementations, resulting in error-prone duplications of the original code. This situation was unacceptable for two reasons. Firstly, there are many multi-language countries, such as Canada, where English and French are two official languages. Secondly, with growing globalization, products are often developed in one country and shipped to several other countries. This forced software developers to rethink the software development process and tackle the issue of internationalization issue. Internationalization of a product means that this the product can be adapted to various languages without making any changes in to the architecture of this product. Localization of the internationalized product refers to the adaptation of this product to a specific locale, which describes the language. For example, an internationalized Java calendar that has been can be localized to French will , and then it will show dates in a the format used in France. Due to the length of the terms internationalization and localization, theSince these terms are rather long, short, mnemonic terms are commonly used:of I18N for internationalization and L10N are used respectively. The shortened terms are names based on the number of letters between the first and last letter of each word. for localization (respectively, for 18 and 10 letters between the first and the last letter).
Product Iinternationalization is hard to implement difficult because it has to goes beyond simple functional suitability of a program and considers the many facets of HCI language used for display. Various languages use different alphabets and scripts, spacing rules, directions, date and currency formats for dates and currency, sort orders, etc. Most programs have a Graphical User Interface (GUI) that is built with standard widgets, such as menus and dialog boxes. Localized versions of internationalized programs not only have to provide appropriate translations of interface menus and prompts, but the standard widgets must be able to handle all different HCI languages properly. Widgets must be able to handle issues such as word length, word positioning, font differences and other such items. Such a “dynamic” interface requires that the The majority of programs are using Graphical User Interfaces (GUI) type of HCI, with a variety of widgets such as menus and dialog boxes. Therefore, different versions localized to different languages have to work regardless of things such as the GUI language (e.g. titles of boxes), the length or position of words within GUI widgets, etc. Therefore, the implementation of the GUI can no not any longer be hard coded in the source code of the program; instead it has to be must be parameterized so that to allow different versions for of different languages can be to be plugged in, without modifications to the programs of the architecture of the program.
Interest in internationalization is growing rapidly, with more applications being internationalized regularly. One example of this is the Nowadays, the interest for internationalization is growing rapidly. More and more applications are internationalized; for example, the user of the Hotel Reservation System, see HRS (2003), where the user can choose one of 25 available languages. Many companies, such as EXCEL Translations (2003) specialize in internationalizing existing applications.; for example EXCEL Translations (2003). At the same time, sSeveral programming languages and APIs provide support for internationalization;, such as for example the Java JDK 1.4, see Java (2001), JSPs, see Seshadi (2003) and NetBeans from SUN, see NetBeans (2003). However, there are few internationalized personal web pages or educational applications, with the notablesome exceptions of the such as the Mozilla, see Mozilla (1998) projectand Opera, see Opera (2003) web browsers. : Both browsers have a core binary that is able to function by loading a separate file that contains the appropriate information for a localized interface. They each provide language files for over 20 different languages, allowing for an easy switch in the interface language.
“…The core Mozilla binary executable for each platform supports computing in North American English, Western European, Central European, Chinese, Japanese and Korean locales. The user interface is contained in resource files and is, for the most part, completely separate from the core binary….” see Mozilla (1998).
Internationalization can be applied to the existing software, or to the software which is being developedpresently under development. In this paper, we describe the internationalization process applied to the development of a website. As a specific example of this general process, we describe the internationalization process using a system called Internationalized Faculty Website (IFW). This system can be used to create a Faculty memberan internationalized website, which shows the Curriculum Vitae (CV) for a this faculty member. The design of IFW uses Separation of Of Concerns, (SOC) to separate tasks that require different types of technical expertise. The A user of IFW (the creator of the webpage) does not need to have any technical background and need only , and follows a series of GUIs in a host language such as English. T that use a specific language, for example English. Then, the English-speaking creator enters her or his CV data, selects one or more languages in which this for the website to be website can be displayed, and then forwards the project to the IFW administrator. The administrator is responsible for finding translators for the task, verifying the completed translation and making the finds translators needed for the task at hand, and when the translations are completed and verified, makes the final product available to the creator. This The final product is a website that “can speak many languages”, i.e. this a website that will initially be displayed in one language but will have the option to be displayed in a variety of languages chosen by the creator of the website. will be initially displayed using one default language, for example Polish, and allow the user of this website to switch to a number of languages selected by the creator; for example Chinese, English, German and French. In addition, the creator can provide options is able to selectively choose which the data towill be shown (for example, only the journal publications), and select the which format in which these data are to be rendered in (for examplesuch as HTML or PDF). All translations are permanently stored and can be reused in future translation tasks. The implementation of IFW uses various recently developed software tools using based on Java and XML,; such as JAXB, versioning of XML documents, XML databases and relational databases with XML support. that can output XML, and versioning of XML documents.
This paper is organized as follows. Section 2 briefly describes some related work, and then Section 3 describes the functionality and implementation of the Internationalized Faculty Website. Finally, Section 4 provides conclusions and describes the future work.
2. Related Work
In this section, we describe major issues related to the internationalization process and the translation process, and the support for the internationalization given by XML and Java.
2.1 Major Issues
Internationalization of a system requires the identification of all data that can be shown to the user and may have different values under different locales. These data, known as resources, include user messages, page headers and trailers, button labels, etc. In addition, there may be many specific formatting problems. For example, translated strings can be much longer or shortervary significantly in size. Consider the example borrowed from Raetzmann & de Young (2003), in which the English text “Authorized User List” consists of 21 characters, while the corresponding German text consists of 32 characters: “Liste der berechtigten Benutzer”. General formulas exist for calculating the space required for string expansion. Fifteen characters should be reserved for each English word of one to six characters. For longer English words, twice the number of characters should be reserved for the translated text. Text sizes can cause other problems, such as the limited space associated with a box label. Extra characters should not be displayed to the left of the box, but should be displayed above the box. Standard formatting issues such as date and currency formats must also be considered. In general, the amount of space required for string expansion can be calculated using various formulas; for example for words consisting of 1 to 6 characters in English there should be 15 space for 15 characters, while for longer words there should be space approximately 2 times larger. Of course, if this text is a label of a box, it can not be placed to the left of this box; instead it should be placed above the box. There are some standard formatting problems when it comes to formatting dates, currencies, etc. For the a description of software that provides some tools to support this kingd of formatting see Section 2.3.
2.2 Translation Process
At the time of writing this paper, Google [add ref] and other sites provides English translation of small text fragments or entire Web pages from French, German, Italian, Spanish, and Portuguese to any other of these languages. A user may choose to set the Google homepage to one of more than 100 interface languages. It also allows the user to set the Google homepage in over 100 languages. Using Microsoft Word 2002 [add ref], is able to perform automatic translation can be performed between Chinese, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, and Spanish. Using a free upgrade from WorldLingo, [add ref] it is possible to you translate between many other languages. Figure 1 shows several Microsoft Word translations of the following English sentence “The interest for internationalization is growing rapidly.” performed by Word “The interest for internationalization is growing rapidly.” For each sentence, wWe also show a translation back to English for each translated sentence. While not perfect, these translations can help the translator to perform the required task. Repeated translations of the same string should be avoided in order to make the translation process more efficient. Translations of common phrases such as “Press any key to continue” should be stored for future reuse. It is expected that as translation systems become more accurate, we will be able to automate much of the translation process.
L'intérêt pour l'internationalisation se développe rapidement
The interest for internationalization develops quickly
Das Interesse für Internationalisierung wächst schnell.
The interest in internationalization grows fast.
The interest rapidly appears for the internationalization
Figure 1: Microsoft Word translations
While not perfect, these translations can help the translator to perform the required task. However, to make the translation process more efficient, repeated translations of the same string should be avoided, and instead these translations should be reused. For example, the translations of the string “Press any key to continue”, which may appear many times in a single program, should be stored for future reuse.
Computer-assisted translation uses the so-called Translation Memory (TM) systems, which that typically consist of the a translator module, the an editor module, and the a database of terms. TM software stores language segments translated by translator in a database, for future reuse. Translators working on text segments can invoke a fuzzy searches for these segments in the database content, and use the results retrieved from the database.ing output. Some well known companies offering TM systems are Déjà Vu (2003), the Translator's Workbench from Trados (2003), and the STAR Transit (2003). Nowadays, TM software typically uses the Translation Memory eXchange, (TMX) format, which is a standardized XML document type for storing collections of segments in multiple languages. For more information on TMX, In this paper, we do not describe TMX; for more information see Lisa (2003).
Using TM software for translations has both advantages and drawbacks. Firstly, TM software views the source text as a collection of text units called segments. Segments may range in size from simple text strings to paragraphs. The technique used to break up the text into segments is called segmentation. Segmentation may remove the context in which the text segment appeared, resulting in an incorrect translation. An example in Savourel (2001) shows that the English word “Help” While there are some advantages of using TM software, there are also drawbacks. First of all, the source text is presented as a collection of text units, called segments. These segments may by strings or paragraphs. The technique used to break the text into segments is called segmentation. Unfortunately, segmentation may remove the context, in which the segment appeared, and result in an incorrect translation. For example, see Savourel (2001), the English word “Help” translates to two different French words, depending on the context in which this the word appeareds. The common solution to the context problem is to use a verification phase, in which the translator reads, verifies and possibly corrects the translation. The second problem with TM systems is that they are expensive, Secondly, TM systems are quite expensive; both in terms of the price of the software and the need to hire specialized personal able to use these systems. (More on automatic translation in Dennett (1995)).