Interval / Title page

The Interval Interchange format

LE Reference / LE2-4002
Origin / Author / TR / Cornelis van der Laan
WP / Task / WP4 / T41
Task Responsible / TR
Distribution / CEC / Partners / public
Status / internal draft / circulated draft / final
Doc. validated by / Contributors / Partners / UC / SC / TEC
Print date / 24/06/96
Nbr pages / 15
File name / T41-0003.DOC
Revision nbr / 2

Cornelis van der Laan[

Interval / Title page

Table of contents

1. Introduction...... 3

1.1 Purpose...... 3

1.2 Structure...... 3

1.3 Revision history...... 3

2. Databases and Information Interchange...... 4

2.1 Terminological Databases...... 4

2.2 Inconsistencies in Termbases...... 5

2.3 The Consolidation Tool...... 5

2.4 Information Interchange...... 6

3. The Interval Interchange Format...... 8

3.1 File Header...... 8

3.2 File Body...... 10

3.2.1 Concept Attributes...... 10

3.2.2 Term Fields...... 11

3.2.3 Term Attributes...... 12

3.3 Example Entry...... 13

4. Relation to other Interchange Formats...... 14

5. Bibliography...... 15

Cornelis van der Laan[

Interval / Title page

1.Introduction

1.1Purpose

This document describes the Interval Interchange Format v1.1 (IIF). IIF is the transfer format for terminological databases between different vendor termbase systems and the INTERVAL Consolidation Tool.

1.2Structure

Section 2 first introduces termbase management systems and the functionality of the INTERVAL Consolidation Tool, which both determine requirements for the INTERVAL Interchange Format. In Section 3 the IIF is described in detail. A comparison with other exchange formats is given in Section 4.

1.3Revision history

  • Version 1.0 (Cornelis van der Laan / TR)
  • Version 2.0 (Pekka Kähkipuro / WS). Minor changes, mainly to make the IIF compatible with the proposed DTD presented in the Interval document T410004.DOC.

2.Databases and Information Interchange

Terminological databases manage information about concepts and the terms used to express them in various languages. They provide a source of reference to translators and casual users needing access to linguistic information, and may serve as machine readable lexicons for machine translation systems.

Different database systems exist today and many termbases have been created with their help. The INTERVAL project adresses the consolidation of single termbases and the merging and consolidation of multiple termbases into one enhanced and more valuable termbase.

2.1Terminological Databases

Terminology Database Management Systems (TDMS) manage database entries describing all terminologically relevant information about concepts. “A concept is a unit of thought constituted through abstraction on the basis of properties common to a set of objects” [ISO 1087]. The common properties of concepts usually include the following:

The domain a term belongs to.

A list of terms or phrases, possibly in several languages, that are used to linguistically express the concept.

The word class of the terms and phrases.

Definitions of the term(s) to distinguish the term from other terms.

Additional notes, examples, contexts to clarify the meaning of terms and their usage.

The internal storage structures a TDSM uses to store and manage this information is irrelevant and always hidden from the user.

More relevant to the user are the search, display and interchange facilities a specific TDMS offers. To be useful, a system must index all terms and sub-parts of terms, allow for flexible and error-tolerant retrieval and be able to easily exchange information with other applications and the user:

Indexing:Terms must be indexed in a language-dependent way. Accented characters in a language must be mapped to their base character to ensure the correct alphabetic sort order. Dead characters must be suppressed.

Searching:Error tolerant search facilities must be provided that allow for retrieval of terms in case of inexact queries. Users often misspell queries or try phonetic transscriptions. In all cases, solutions must be found or the database will be useless.

Display:Entries must be presented to the user in an easy to survey format. All relevant information should be presented together without the need to assemble it by hand. Navigation between cross-referenced entries must be easy.

Exchange:Systems must be able to exchange information with other applications. This might be a user’s word processor, and only parts of single entries have to be exchanged, or it might involve other TDMS and whole databases have to be converted to an information preserving and savely transmissible interchange format.

2.2Inconsistencies in Termbases

Many termbases exist today, ranging from monolingual glossaries over bilingual translation dictionaries to full blown multilingual resources containing exhaustive definitory information.

Termbases may suffer from various deficiencies, including the following:

Termbases may be incomplete because they contain only monolingual or bilingual entries.

They may be incomplete because not every entry contains fields regarded as necessary for the termbase, e.g., definitions or source references.

They may contain duplicate terms in one or more languages.

They may contain dummy terms introduced during the creation of concept entries, which have not been replaced with the actual terms.

Crossreferences between entries may be invalid because the referred element is missing or belongs to another domain.

etc

The INTERVAL Consolidation Tool is devised as a means to detect and possibly resolve these inconsistencies. In general, the following information is deemed relevant for consolidation:

The domain of a concept

The terms naming the concept

The definitions of terms

Additional notes and contextual information

Source references of terms, definitions and notes

Administrational data

The Interval Interchange Format is devised to represent and transfer all this information. It will be fully described in section 3.

2.3The Consolidation Tool

The INTERVAL Consolidation Tool (CT, for short) is conceived as a tool to aid terminologists consolidate single or multiple termbases. The CT is a stand-alone tool and thus independent from existing TDMS. Its purpose is not to supersede existing TDMS, but to endow them with the functionality of consolidation.

The CT uses its own database engine to store, index, and consolidate termbases. Input termbases are accepted in the INTERVAL Interchange Format. The consolidation process generates a list of entries to be merged. This list will — in the first phase of the project — be output in a textual format that can be used to look up entries in the vendor termbase and to consolidate them by hand. In a later stage of the project, the CT will use interprocess communication with vendor termbases for automatic display and merging of entries.

See [IV-ConsMeth] for a description of the INTERVAL consolidation methodology and the set of consolidation rules to be implemented in the CT.

2.4Information Interchange

Exchanging information is a complicated and error-prone process. The goal is to correctly and reliably transmit information from one software on one computer system to another software system on another computer by means of an unreliable information transmission channel. Several steps are involved in this process, each of which may be subject to errors:

Generation:The source data has to be converted from software-internal datastructures to a linear octet sequence. This conversion must be information preserving, otherwise the receiver will not be able to restore an equivalent of the data source. Structured information like trees or graphs must be linearized accordingly. Machine-dependent byte ordering must be resolved.

Transmission: The generated octet stream must be savely transmitted through an unreliable transmission channel[1]. This may involve encoding the 8-bit octet stream in 7-bit US-ASCII format (or a subpart thereof), since the stream may pass channels or machines that are not ASCII-based (e.g., EBCDIC).

Interpretation: The receiver must interpret the byte stream, decode any previous character encodings and rebuild datastructures equivalent to the source database.

For reliable information exchange, several formats have been proposed and standardized by international committees. The common ground is laid by SGML, from which TIF and the newly proposed MARTIF are derived (see section 4).

Within the INTERVAL project, information interchange is needed between different termbase systems and the stand-alone Consolidation Tool. Because of the special purpose of the CT, the IIF will not be a general purpose exchange format like TIF but be determined by pragmatic considerations:

Generating IIF out of existing termbase systems or termbases in textual formats must be easy.

Interpretation of IIF must be easy for the Consolidation Tool.

IIF need not represent all information stored in termbases but only those needed for the consolidation task.

For multilingual databases comprising not only languages using Western character sets (US-ASCII or ISO-8859-1), the question of character encoding arises. Most of the termbases the INTERVAL project is concerned with use at most 2 different character sets, namely ISO-8859-1 and ISO-8859-7 for Greek. To allow for more languages and to not unnecessarily restrict the Consolidation Tool in its general applicability, the CT accepts source bases in mixed mode encodings as well as in the general UNICODE encoding.

3.The Interval Interchange Format

IIF is a special purpose, SGML-based interchange format. It is not intended to replace TIF or any other similarly general and expressive format. Due to the considerations stated at the end of section 2.4, the equivalence condition of the transmission procedure has been sacrificed in order to allow for easy use within the INTERVAL project.

IIF thus does not contain a superset of all possible tags used in existing termbases. It is restricted to a fixed set of tags covering the tags most commonly found in all termbases –domain information, language tags, definitions, source references. The fixed tagnames allow a fixed interpretation of the fields’ contents, thus preventing the need for a general and complex entry representation in the CT.

IIF documents are line-oriented, fully-tagged SGML documents consisting of a header with descriptive information, and a body containing one or more terminological entries. An IIF document always starts with the <IIF> tag and ends with the </IIF> tag. The following sections describe the structure of the header and the body.

3.1File Header

Each input stream to the Consolidation Tool starts with a header. The header specifies the transfer encoding used for the stream body, which may be either UNICODE (for the whole body including the tags) or different ISO-8859 encodings for individual language fields. If ISO codes are used, they must be specified for each language field in the body.

The header is enclosed by <Header>…</Header> tags[2] and is always encoded in ISO-8859-1. Allowed header fields are given in the following table:

Tag name / Possible values
<Encoding> / UNICODE
<Encodinglang="DE">
<Encodinglang="EN">
etc. for all language abbreviations / ISO-8859-1 – ISO-8859-9

A language specific encoding tag affects entries in the specified language only. Encoding tag without language abbreviation affects all other languages. The following language abbreviations have been defined:

Language abbreviation / Description
CA / Catalan term
DA / Danish term
EN / English term
EL / Greek term
ES / Spanish term
FI / Finish term
FR / French term
IT / Italian term
LA / Latin term
NL / Dutch term
PT / Portugese term
SV / Swedish term

The list of language abbreviations may be extended as needed. By default, if a termbase is not transmitted in UNICODE, the character set used for encoding is ISO-8859-7 for <EL> and ISO-8859-1 for all others. Language switching is performed automatically by the CT.

The header might contain additional conversion specific administrational information. This information is read and preserved by the CT, but ignored during the consolidation process.

Example: A header for a termbase containing German, French and Greek entries transmitted using their respective character set might look like follows:

<Header>

<Encoding lang="DE">ISO-8859-1</Encoding>

<Encoding lang="FR">ISO-8859-1</Encoding>

<Encoding lang="EL">ISO-8859-7</Encoding>

</Header>

Example: This header specifies that the termbase is transmitted in UNICODE.

<Header>

<Encoding>UNICODE</Encoding>

</Header>

3.2File Body

The file body contains a list of the concept entries making up the termbase. Each entry is embraced by <Entry>...</Entry> tags. Each concept entry contains global concept attributes and a list of terms in different languages, each of which can have term attributes.

3.2.1Concept Attributes

Every concept entry can be annotated with one or more concept attributes. If concept attributes are present, they are embraced by the <Concept>...</Concept> tags. A concept may have the following attributes:

Tag name / Description
<Project> / The project or (sub-)database an entry belongs to.
<Subject> / The domain(s) of the entry.
<ConRef> / Reference of the source document that first mentioned a concept.
<Comment> / Additional comments to the entry. These might be automati-cally generated notes. (Ignored during consolidation).

List-structured values for a concept attribute (such as multiple subjects) are specified as a list consisting of multiple tags and one list item per tag.

3.2.2Term Fields

After the global attributes of an entry follows the list of the term(s) used to express the concept in one or more languages. Each term and related term attributes (see 3.2.3) are embraced by the <Termlang="XX">...</Term> tags or by the <Equi lang="XX">... </Equi> tags, where XX is an abbreviation for the term's language (see 3.1). Each language may occur multiple times per concept (or not at all).

The tags <Term lang="XX">...</Term> marks a primary term for the concept in the given language. This means that the CT may rely on it during consolidation. If the term is secondary and only an equivalent for the primary term (e.g. it has not been validated for two-way translations) the tags <Equi lang="XX">..</Equi> are used instead. In most cases, a concept should have at least one primary term. If a termbase does not make any difference between primary and secondary terms, all terms are considered primary.

The actual term string follows the <Term...> or <Equi...> tag and is embraced by the <TermStr>...</TermStr> tags. After the term string, there may be any number of additional term attributes before the ending tag </Term> or </Equi>.

3.2.3Term Attributes

Each term can be associated with grammatical, definitory and contextual information. Each of the following tags is considered a text field for the purpose of consolidation and can occur zero or more times, unless otherwise noted.

Tag name / Description
<Grammar> / Grammatical information of the term (n/v/adj/adv). May occur at most once per term.
<TermTyp> / The type of a term (Term, Abbreviation or Phrase). May occur at most once per term.
<TermRef> / The reference of the term.
<Def> / The definition of the term.
<DefRef> / The reference of the definition.
<Note> / Additional notes or comments to the term.
<NoteRef> / The reference of the note.
<Context> / Contextual information for describing the usage of the term.
<CtxtRef> / The reference of the context.

3.3Example Entry

The following example shows an IIF encoding of a tri-lingual entry. The entry contains terms in French, Spanish and Greek, the latter being encoded in ISO-8859-7.

<IIF>

<header>

<encoding lang="FR">ISO-8859-1</encoding>

<encoding lang="SP">ISO-8859-1</encoding>

<encoding lang="EL">ISO-8859-7</encoding>

</header>

<Entry>

<Concept>

<Project>INTERVAL</Project>

<Subject>telecommunication</Subject>

<ConRef>unknown</ConRef>

<Comment>OriginalEntry:110575</Comment>

</Concept>

<Term lang="FR">

<TermStr>système de décodage d'informations routières</TermStr>

<TermTyp>Term</TermTyp>

<TermRef>unknown</TermRef>

<Def>Système permettant l'identification des programmes,
le choix de la meilleure fréquence alternative, l'affichage
des programmes reçus ainsi que l'interruption de programme
pour des informations de radioguidage</Def>

<DefRef>Echo 23.3.89</DefRef>

</Term>

<Equi lang="SP">

<TermStr>sistema de codificación de información viaria</TermStr>

<TermTyp>Term</TermTyp>

<TermRef>unknown</TermRef>

</Equi>

<Equi lang="EL">

<TermStr>óýóôçìá áðïêùäéêïðïßçóçò ðëçñïöïñéþí</TermStr>

<TermTyp>Term</TermTyp>

<TermRef>unknown</TermRef>

<ELNote>ïäéêþí</ELNote>

</Equi>

</Entry>

</IIF>

4.Relation to other Interchange Formats

Why do we devise just another interchange format and not use an existing one?

TIF and the lately proposed MARTIF are general purpose interchange formats for terminologcal databases and offer a huge set of tags for all kind of information that might happen to be stored in termbases. They also allow structured tags. We don’t need all this, makes it too complicated.

5.Bibliography

[ISO 1087] / Terminology Vocabulary. ISO 1087:1990 (E/F).
[IV-ConsMeth] / Consolidation Methodology. INTERVAL internal draft.

Cornelis van der Laan[

[1] Physical transmission problems are ignored in this document. They must be encapsulated by hardware layers and appropriate software protocols, providing a reliable channel to the application.

[2] All tags are case-insensitive.