Using XSL and XQL for efficient, customised access to dictionary information

Kevin Jansz, Department of Computer Science, University of Sydney, Australia

Sng, Wee Jim, School of Applied Science, Nanyang Technological University, Singapore

Nitin Indurkhya, School of Applied Science, Nanyang Technological University, Singapore

Christopher Manning, Departments of Computer Science and Linguistics, Stanford University, USA

Abstract

XML is highly suited to representing richly structured information such as dictionary content, and conversely this well-structured storage of the information enables innovative browsing interfaces. We demonstrated this previously with the development of Kirrkirr, a web-based application that allows users to interactively explore a Warlpiri (a Central Australian language) dictionary in XML format. Two key design issues are customised presentation, and efficient access. The greater the level of customisation, the broader range of users Kirrkirr can accommodate. Efficient access is important if the application is to scale up to larger, more complex dictionaries. In this paper, we discuss these two issues and describe the usage of XSL and XQL to further enhance Kirrkirr. While Kirrkirr already provides a number of interfaces to the lexical information, the challenge of creating new features and providing even greater flexibility lies in allowing the user to access certain parts of the XML database without the overhead of greater memory and time usage. By enhancing the indexing techniques of the original Kirrkirr, users may use XSL to personalise the information they access. Emerging technologies such as XQL give the potential for not only efficient access to dictionary entries, but of the fields within the entries. Performance evaluation suggests that the use of XSL and XQL has had a very significant impact on Kirrkirr. The results can easily be seen to apply to a broad range of similar applications.

1 Introduction

1.1 Computational Lexicography

A language is more than individual words with a definition. It is a vast network of associations between words and within and across the concepts represented by words. The work described here is part of broader project that has the general aim of providing people with a better understanding of this conceptual map. In particular, traditional paper dictionaries offer very limited ways for making such networks of meaning visible, whereas, on a computer, there are, in theory, no such limitations to the way information can be displayed. However, while dictionaries on computers are now commonplace, there has been little attempt to utilise the potential of the new medium. Most existing electronic dictionaries, whether on the web or on CDROM, present a plain, search-oriented representation of the paper version. In contrast, our goal has been to build fun dictionary tools that are effective for browsing and incidental language learning. Minimally, they should be as effective for browsing as the process of flicking through pages of a paper dictionary, but beyond that we aim to use the computer to providing new means of effective browsing. For instance, we can show words grouped by meaning rather than spelling.

1.2 Initial focus: Warlpiri

The initial focus of our project was Warlpiri, an Australian Aboriginal language spoken in the Tanami desert northwest of Alice Springs. There were a number of factors influencing this choice:

  • Rich lexical materials have been being collected by linguists since the 1950s, resulting in one of the most comprehensive lexical databases for any Australian Language (Laughren and Nash 1983)
  • There is a relatively large community of people, who speak Warlpiri as their first language, and some of whom have had the opportunity of bilingual schooling in Warlpiri and English, who would be able to benefit from the existence of a suitable dictionary
  • Until now, results have not been produced in a format usable by the community (only as raw printouts)
  • Educational goals. Dictionary structure and usability are often dictated by professional linguists, while the needs of others (speakers, semi-speakers, young users, second language learners) are not met
  • The low level of literacy in the region makes an e-dictionary potentially more useful than a paper edition, since its use is less dependent on good knowledge of spelling and alphabetical order (Corris et al. forthcoming)

2 Overview of Kirrkirr

Details of the first version of Kirrkirr, our Warlpiri dictionary browser, have been described before (Jansz 1998; Jansz, Manning and Indurkhya 1999a; Jansz, Manning and Indurkhya 1999b). The ideas in the design are quite general and applicable to other dictionaries as well. As an environment for the interactive exploration of dictionaries, it attempts to more fully utilise graphical interfaces, hypertext, multimedia, and different ways of indexing and accessing information

Written in Java, it can either be run over the web (high bandwidth) or run locally (here Java’s main advantage is cross-platform support). As shown in Figure 1, it is currently made up of five main modules:

•Graph Layout: words and their various relationships with other words are drawn in the form of an animated network graph of nodes and links between them. The user is encouraged to move the words around, as the nodes seem to “float” on the screen. The network can be made as large as they want by progressively expanding the links around related words.

Formatted Entries: the user is also able to read the large amount of information contained in the dictionary entries. Unlike the compact formatted entires of a paper dictionary, the entries are nicely laid out with sensible use of colour where appropriate. There is also functionality for the user to click on a cross-referenced word and have the system jump to the entry for that word.

•Notes: it is important for a system that is to be used for learning, that there is a facility for something like ‘pencil notes in the margin’. The users can very easily jot down (ie type) notes for a specific word as they use the system. These notes are saved in a user profile that can be search later on. There is even the option to move these notes around in separate windows, like post-it notes.

•Multimedia: a very rare feature for an e-dictionary, is the ability to hear the words of the dictionary to understand their pronunciation. This feature, coupled with various pictures relating to the word being looked at makes the system very user friendly.

Advanced searching capabilities: included with the many ‘fun’ aspects of the system is the ability for serious searching of the database. The users can perform searches using regular expressions, approximate ‘sounds like’ spelling or just plain text. Facilitated by the well marked-up XML dictionary, any field in the dictionary can be searched for.

Figure 1: Kirrkirr

As reflected in the system modules, this application is unlike any other e-dictionary as it was designed to cater for the needs of Warlpiri speakers with various levels of competence. Features such as the searching facility allow information to be accessed easily and quickly, while incorporation of animation and sounds makes the dictionary usable by speakers with little to no background with the language.

3. Efficient Access of XML content

3.1 XDI: An index for XML-based Dictionaries

Storing the lexical database in an XML formatted file is an effective median between the structure and built-in querying of a relational database and the flexibility and portability of a plain text document. The strengths of this approach were best appreciated in the development of the Kirrkirr dictionary browser.

Initial testing showed that if the program simply read in the entire XML file (about 10Mb of text) and stored it as parsed data structures within memory, memory usage was excessively high. A simple solution was to create an index file that contained the words, information about cross-references for the graphical display, and the corresponding file position of its entry in the XML file. As a result, only the index need reside in memory and of the 9300 entries in the dictionary, only those requested by the user, will be read in and processed. The use of an index resulted in significant performance improvement. This approach worked well with the XML parser being used (Microstar's Ælfred parser). Although the parser (like other XML parsers of which we are aware) was built to parse an entire file at a time, it was relatively easy to adapt the code to allow processing of just one entry. The use of an index file was also well suited to usage over the Web. Because the parser only processes the parts of the XML file, when they are required, the system is very efficient and can be used even over a low bandwidth connection. Once whole entries are parsed, they are kept temporarily in a memory cache that speeds up subsequent accesses to the same entry in the browsing session. The XDI scheme is described in Figure 2.

Figure 2: XDI: Indexing the XML lexical database for better information access

The idea of using an index is well-known from database systems. By using an index, Kirrkirr implicitly recognises the dictionary as a database (albeit a richly structured one) and XDI can be seen as customised index for such a database. One might well argue that if one is to view the dictionary as a database, then perhaps its better to use off-the-shelf database products that have their own customised index structures for efficient access. However, this is a tradeoff with preserving the rich structure in the dictionary. By using XDI, Kirrkirr tries to strike a balance between the indexing capabilities of standard database systems and the expressiveness and portability of XML. However, by using a customised solution, the system complexity goes up and precludes one from taking advantage of alternative solutions to the general problem of XML information retrieval.

3.2 XQL: A standard query language for XML content

3.2.1 The Potential of XQL

XQL is a set of extensions to the Extensible Style Language (XSL) specification that allows developers using XML to easily execute powerful, complex queries on XML documents. Proposed to the W3C by representatives from Microsoft, Texcel and WebMethods in 1998, it competes with the SQL-oriented XML-QL whose specification was submitted to the W3C by AT&T Labs. However, we will not further discuss XML-QL here, but only the XQL API underlying the new data access model of Kirrkirr.

Traditionally, structured queries have been used primarily for relational or object-oriented databases, and documents were queried with relatively unstructured full-text queries. Although sophisticated query engines for structured documents have existed for some time, they have not been a mainstream application. XML documents are structured documents – they blur the distinction between data and documents, allowing documents to be treated as data sources, and traditional data sources to be treated as documents. Some XML documents are nothing more than an ASCII representation of data that might traditionally have been stored in a database. Others are documents containing very little structure beyond the use of headers and tables. Kirrkirr is somewhere in between: an e-dictionary has complex recursive structure, but also much relatively unstructured free text, and clearly needs effective query mechanisms for access.

Database developers have taken for granted the ability to execute queries on data stores for decades. However, XML being a young data technology, querying functionality had been very limited. XQL gives developers the querying functionality they have become used to in the database world, including the following:

•Functionality equivalent to the SQL SELECT Statement

•Functionality equivalent to the SQL WHERE Statement

•Boolean logic operators (e.g. AND, OR, NOT)

•Comparison operators (e.g. greater than, less than, less than or equal to)

•Wildcard operators (e.g. *)

The major differences between SQL and XQL are summarized in Table 1. It is clear that XQL is an invaluable tool with huge potential for accessing dictionary information stored in XML format.

SQL / XQL
The database is a set of tables. / The database is a set of one or more XML documents.
Queries are done in SQL, a query language that uses tables as a basic model. / Queries are done in XQL, a query language that uses thestructure of XML as a basic model.
The FROM clause determines the tables which are examined by the query. / A query is given a set of input nodes from one or more documents, and examines those nodes and their descendants.
The result of a query is a table containing a set of rows. / The result of a query is a set of XML document nodes, which can be wrapped in a root node to create a well-formed XML document.

Table 1: Main differences between SQL and XQL

3.2.2 DOM (Document Object Model)

XQL implementations typically operate on a model of the XML document known as the DOM (Document Object Model). The DOM is a platform-independent, programming-language-neutral application programming interface (API) for HTML and XML documents. Its core outlines a family of types that represent all the objects that make up an XML document: elements, attributes, entity references, comments, textual data and processing instructions. With that, it defines the logical structure of documents and the way a document is accessed and manipulated. (DOM specifies how XML documents are represented as objects, so that they may be used in object oriented programs.)

Increasingly, XML is being used as a way of representing many different kinds of information that may be stored in diverse systems, and much of this would traditionally be seen as data rather than as documents. Nevertheless, XML presents this data as documents, and the DOM may be used to manage this data to allow programs to access and modify the content and the structure of XML documents from within applications. Anything found in an XML document can be accessed, changed, deleted, or added using the DOM, except for the XML internal and external subsets for which DOM interfaces have not yet been provided.

After the XML document has been parsed into a collection of objects conforming to the DOM protocol, the object model can be manipulated in any way that makes sense. This mechanism is also known as the "random access" protocol, as any part of the data can be visited at any time. The DOM usually resides in memory (it is the output of the XML parser), but it can also be stored on disk (to save on the time needed to parse the XML repeatedly) as a Persistent DOM (PDOM). When an XML document is large and not likely to change much, as is the case for dictionaries, using its PDOM representation can significantly speedup XQL querying.

3.2.3 Using XQL in Kirrkirr

The XQL search engine and the PDOM used for the new dictionary representation in Kirrkirr originated from a research project at GMD-IPSI [HREF8], the Institute for Integrated Publication and Information Systems of the German National Research Centre for Information Technology. The PDOM incorporated consolidated concepts from many years of leading edge research and development in the fields of federated databases and document management systems. Persistency is achieved by indexed, binary files. The XML document is parsed once and stored in binary form, accessible to DOM operations without the overhead of parsing them when the information is required. The implementation uses a robust and efficient mix of indexing and query optimisation techniques. A cache architecture further boosts performance. This approach scales very well beyond the limitations of main memory. The PDOM is generated from the XML dictionary. Subsequently, Kirrkirr uses XQL to query the PDOM. Parsing of the XML need not be done repeatedly (it is only necessary when the dictionary changes) and access is faster.

The following is a simplified XML hierarchy of the dictionary.

Figure 3: A sample DOM tree

The dictionary is a sequence of many entries, which include some subset of a large number of dictionary components, including a headword (HW element) and perhaps one or more pictures (IMAGE element).

To find an entry whose headword (<HW>) is 'jaja', the following query ([ ] is the filter clause, equivalent to the WHERE clause in SQL) may be used:

/DICTIONARY/ENTRY[HW='jaja']

Alternatively, if the PDOM index is known, say index = 9 for the word jaja, we can use the query:

/DICTIONARY/ENTRY[9]

The time taken to execute the above queries is very slow and depends very much on the number of <ENTRY> nodes in <DICTIONARY>. This is bad news even for the 9300 entries of the current Warlpiri dictionary, and would be totally impractical for something like a large English dictionary, which might have 100,000 or more headwords.