The Provision of Modern Irish Text Corpora
Gearóid Ó Néill, Annette McElligott,
and Richard F. E. Sutcliffe
Department of Computer Science
and Information Systems,
University of Limerick
+353 61 202789 Tel (Ó Néill)
+353 61 202724 Tel (McElligott)
+353 61 202706 Tel (Sutcliffe)
+353 61 330876 Fax
In the last few years the amount of text corpora which are available for language engineering research has increased enormously. However, the range of materials available in Irish remains quite limited. The CURIA (1997) project of the Royal Irish Academy and University College is most enterprising but has so far concentrated mainly on early Irish manuscripts whereas we have a particular interest in modern Irish. The aim of this work is to scan a number of modern Irish works and to convert them to an SGML Corpus marked up in the LITE Data Type Definition (Burnard and Sperberg-McQueen, 1995) of the Text Encoding Initiative (TEI, 1994). Emphasis will be placed on parallel Irish-English works which are especially interesting from a language engineering research perspective. Permission has been obtained from a number of publishers to encode their works in this way for research purposes.
A pilot study has been carried out to establish the feasibility of the work using an extract from A Thig Ná Tit Orm (Dainín Ó Seé, 1996). At present this is being marked up.
There are a number of applications of the resulting data. We are especially interested in the folllowing:
- The production of interactive texts intended for language learning purposes,
- Experiments with automatic alignment programs such as align (Gale and Church, 1993),
- Work on automatic word sense disambiguation,
- The linking of corpus usages with concept ontologies (e.g. Sutcliffe, Christ, McElligott, Stöckert, Heid and Feldweg, 1997).
References
Burnard, L., & Sperberg-McQueen, C. M. (1995). TEI Lite: An Introduction to Text Encoding for Interchange (Document TEI U 5).
CURIA (1997).
Dainín Ó Seé, Maidhc (1996). A Thig Ná Tit Orm. Baile Atha Cliath, Éire: C. J. Fallon
Sutcliffe, R. F. E., Christ, O., McElligott, A., Stöckert, C., Heid, U., Feldweg, H. (1997). Mapping German Word Senses onto a Core Ontology to Create a Multilingual Language Engineering Resource. Proceedings of the AAAI Spring Symposium `Ontological Engineering', Stanford University, California, 24-26 March, 1997.
TEI (1994). The Text Encoding Initiative.
STARFISH: A Paradigm and Software Standard for Building Experimental Language Engineering Programs
Richard F. E. Sutcliffe
Department of Computer Science
and Information Systems,
University of Limerick
+353 61 202706 Tel
+353 61 330876 Fax
The object of this work is to develop a software standard and development paradigm to allow robust language engineering programs to be developed quickly for experimental purposes. The paradigm comprises software standards for C and Prolog, a standard method for interfacing these to each other and a convenient interface to the World Wide Web. The rules of the standard are very simple but they are effective: any conforming routine is guaranteed to work with any other with no name clashes or other interference. Software can be re-used easily. In addition, a servicable interface can be created with very little effort — an interface which runs on any type of computer. A large number of tools have been created which conform to the paradigm including interfaces to machine-readable dictionaries, a word stemmer, the WordNet concept ontology, the Link Parser and the Brill Tagger, the Robust Parser for English, a programmable lexical analyser, a sentence recogniser and finally the JUMAN Japanese lexical analyser and part-of-speech tagger Many of these tools can be viewed in the Demonstration Suite which may be accessed via the Centre for Language Engineering WWW pages (CLE, 1997).
The key advantage of the paradigm is that all components can work will all others. The task of building a system is therefore reduced to loading the required modules, creating tailored versions of them for the application in question and carrying out transformations on the output of each component so that it suits the input requirements of the next one in the chain. STARFISH is less sophisticated than, say, GATE (Cunningham, Wilks and Gaizauskas) but it is more suitable for rapid prototyping. Work is ongoing to complete the software specification and to define the interfaces between components more closely. A suite of HTML processing tools is also being developed.
References
CLE (1997). Centre for Language Engineering.
Cunningham, H., Wilks, Y., & Gaizauskas, R. (1996). GATE — a General Architecture for Text Engineering. Proceedings of the 16th Conference on Computational Linguistics (COLING-96).
Development of the Robust Parser
Richard F. E. Sutcliffe, Annette McElligott
Department of Computer Science
and Information Systems,
University of Limerick
+353 61 202724 Tel (McElligott)
+353 61 202706 Tel (Sutcliffe)
+353 61 330876 Fax
Following the workshop Industrial Parsing of Software Manuals held in 1995, at which a number of different parsers were evaluated using a corpus of software manual utterances (Sutcliffe, Koch and McElligott, 1996), the need was recognised to develop an efficient parser which could extract certain simple constructs from inputs of artbitrary complexity. A multiple scan parsing algorithm was developed to meet this need. The input is first tagged for part-of-speech using the Brill Tagger. The parser then scans the input repeatedly looking for certain constructs and replacing instances of them by suitable non-terminal symbols. At present the syntactic scans look for noun phrases, prepositional phrases and verb groups. Effectively the desired constructs are being defined in a hierarchy of context free grammars in which each level can refer both to terminal symbols directly and to any non-terminal symbols defined by grammars beneath it in the hierarchy. The parser is efficient because backtracking is minimised.. One advantage of this approach is that it allows two-pass lexical analysis, heuristic-based sentence recognition, compound recognition, parsing, predicate-argument-recognition and semantic case frame extraction to be integrated within the same computational framework which is convenient in engineering applications. The output from the parser is a version of the original input in which contiguous groups of words have been transformed into parse trees. Here is an example:
[cvg(and,[save]),
cnp(and,[np(cd(?,?),cn(and,[file]))]),
cpp(and,[pp(p(and,[under]),cnp(and,[np(cd(?,?),cn(and,[name]))]))]),
cpp(and,[pp(p(and,[with]),cnp(and,[np(cd(?,?),cn(and,[extension]))]))]),[]]
A demonstration version is available at the Centre for Language Engineerng WWW pages (CLE, 1997).. At present work is ongoing to extend the grammatical coverage and to evaluate the accoracy of analysis. In addition we plan to produce versions in German and Japanese.
References
CLE (1997). Centre for Language Engineering.
Sutcliffe, R. F. E., Koch, H.-D., McElligott, A. (Eds.) (1996). Industrial Parsing of Software Manuals. Amsterdam, The Netherlands: Rodopi.
International WordNet: Experiments in Multilingual Ontologies
Richard F. E. Sutcliffe, Annette McElligott,
Donie O’Sullivan and Gearóid Ó Néill
Department of Computer Science
and Information Systems,
University of Limerick
+353 61 202789 Tel (Ó Néill)
+353 61 202724 Tel (McElligott)
+353 61 202706 Tel (Sutcliffe)
+353 61 330876 Fax
The aim is to establish the extent to which an ontology can be used to represent the meanings of concepts in different natural languages. The ontology being used is the Princeton WordNet which is based on American English. So far a number of different mapping paradigms have been identified. Essentially these attempt to map a WordNet sense onto a target language sense, map a target language sense onto a WordNet sense or use a combination of the two. In all cases the method used to carry out the mapping is to ask questions of a population of subjects. Originally, printed questionnaires were used. Next, electronic mail messages were employed. In our most recent work, software has been developed which allows the mapping process to be carried out interactively over the World Wide Web.
Various methods can be used to specify the sense of a word, including dictionary definitions, extracts from ontologies and usages taken from corpora. The most appealing of these for us is the last as it effectively allows word senses in different corpora and indeed languages to be linked via a core ontology. So far, studies have been carried out in German, Irish and Russian (Sutcliffe, McElligott, O’Sullivan, Polikarpov, Kuzmin and Ó Néill, 1996; Sutcliffe, O’Sullivan, McElligott and Ó Néill, 1996; Sutcliffe, O’Sullivan, Polikarpov, Kuzmin, McElligott and Véronis, 1996; Sutcliffe, Christ, Stöckert, Heid and Feldweg, 1997). Two IWN demonstration systems have been created which are available over the World Wide Web. These can be reach via the Centre for Language Engineering pages (CLE. 1997).
References
CLE (1997). Centre for Language Engineering.
Sutcliffe, R. F. E., Christ, O., McElligott, A., Stöckert, C., Heid, U. Feldweg, H. (1997). Mapping German Word Senses onto a Core Ontology to Create a Multilingual Language Engineering Resource. Proceedings of the AAAI Spring Symposium `Ontological Engineering', Stanford University, California, 24-26 March, 1997.
Sutcliffe, R. F. E., O'Sullivan, D., McElligott, A., Ó Néill, G. (1996). Irish-English Mappings in International WordNet: A Pilot Study. In Proceedings of the International Translation Studies Conference, Dublin City University, 9-11 May, 1996.
Sutcliffe, R. F. E., O’Sullivan, D., Polikarpov, A. A., Kuzmin, L. A., McElligott, A., Véronis, J. (1996). IWNR - Extending A Public Multilingual Taxonomy to Russian. In Proceedings of the Workshop Multilinguality in the Lexicon, AISB Second Tutorial and Workshop Series, University of Sussex, Brighton, UK, 31 March - 2 April 1996, 14-25.
Tools for Japanese Information Retrieval
Richard F. E. Sutcliffe, Donie O’Sullivan,
Diarmuid Hayes and Nao Nashimoto
Department of Computer Science
and Information Systems,
University of Limerick
+353 61 202396 Tel (Nashimoto)
+353 61 202706 Tel (Sutcliffe)
+353 61 330876 Fax
The objective of this project is to investigate tools which are currently available for Japanese language engineering and to build an experimental text retrieval system working in both Japanese and English. At present tools such EDICT (a machine readable dictionary), JUMAN (a tagger and segmenter), MULE (an editor) and jLatex (a text formatter) are being installed and evaluated. So far, we have JUMAN working, we can convert between Hiragana/Katakana and Roma-ji and we can work with EDICT. Essentially this means that a Japanese string (which might contain Hiragana, Katakana, Kanji and Roma-ji) can be converted into a list of lexemes, the part-of-speech of each can be determined, the pronounciation of any Kanji words can be found, and the meanings of all words can be looked up in the dictionary.
We are working within the Starfish paradigm which means that a WWW browser can be used for display. The latest versions of such browsers can display Japanese and ISO-Latin-1 characters in the same frame without difficulty.
The next steps in this work will be to build a simple text-retrieval engine and to develop a Japanese version of the Robust Parser. We would also like to map a set of Japanese word senses onto WordNet using the IWN paradigm (e.g. Sutcliffe, McElligott, O’Sullivan, Polikarpov, Kuzmin, Ó Néill and Véronis, 1996) so that a conceptual retrieval system can be built. This will allow us to extend the SIFT system (Sutcliffe, Boersma, Bon, Donker, Ferris, Hellwig, Hyland, Koch, Masereeuw, McElligott, O’Sullivan, Relihan, Serail, Schmidt, Sheahan, Slater, Visser and Vossen, 1995) so that it can work with queries and texts in both Japanese and English.
References
Sutcliffe, R. F. E., Boersma, P., Bon, A., Donker, T., Ferris, M. C., Hellwig, P., Hyland, P., Koch, H.-D., Masereeuw, P., McElligott, A., O'Sullivan, D., Relihan, L., Serail, I., Schmidt, I., Sheahan, L., Slater, B., Visser, H., Vossen, P. (1995). Beyond Keywords: Accurate Retrieval from Full Text Documents. Proceedings of the 2nd Language Engineering Convention, Queen Elizabeth II Conference Centre, London, UK, 16-18 October 1995 .
Sutcliffe, R. F. E., McElligott, A., O’Sullivan, D., Polikarpov, A. A., Kuzmin, L. A., Ó Néill, G., & Véronis, J. (1996). An Interactive Approach to the Creation of a Multilingual Concept Ontology for Language Engineering. Proceedings of the Workshop `Multilinguality in the Software Industry', European Conference on Artificial Intelligence, ECAI'96, Budapest University of Economics, Budapest, Hungary, Monday, 12 August 1996.
Lunde, K. (1993). Understanding Japanese Information Processing. Sebastopol, CA: O'Reilly and Associates, Inc.