The Linked TEI: Text Encoding in the Web

The TEI Conference and Members Meeting 2013 took place at the DigiLab centro interdipartimentale di ricerca e servici at Sapienza University in Rome from 2nd to 5th October 2013. The following is a summary of the papers attended:

Wednesday, Opening keynote

"Text encoding, ontologies, and the future", Allen Renear

Allen Renear, a Professor at the Graduate School of Library and Information Science (GSLIS), presented on his work in the TEI community. He described his progression as moving from descriptive markup to OHCO to XML semantics to ontologies and scientific publishing and most recently to conceptual foundations. Conceptual foundations are the philosophical and intellectual underpinnings of much of the information-related work done in the humanities and sciences. The need for increased precision in our conceptual modelling will require a thorough and precise definition of many of the concepts that we use in everyday language today, such as information, file, publication. What does it mean, for example, to have the "same data" in different representations, e.g. relational tables or RDF, serialized in XML or N3, and encoded in UTF-8 or UTF-16? There has been little discussion or reflection in our community on these issues. There seems to be as much confusion about the term dataset as there is about text or document as a study by the presenter revealed. Datasets were defined by scientists with a focus either on grouping (set, aggregation, collection), content, relatedness, or purpose. It's therefore impossible to define datasets precisely, and the colloquial use of the concept must be replaced by a family of new, more specific concepts. FRBR is an example of such an attempt at precision to eliminate the imprecise concept "book" with "work", "expression", "manifestation", and "item", the so-called group 1 entities. Coloquially, for book-like objects, we can think of work, text, edition, and copy. These definitions form an interlocking cascade. Every entity type is assigned a set of unique attributes, these attribute assignments are disjoint (mutually exclusive). Can we adopt FRBR as an ontology for datasets, texts etc.? Any form of communication involves propositional content, and propositions can be content, attitudes, and the bearers of the message. Encodings of identical propositions can be in writing or speech systems and can therefore be very differently encoded. Is it possible to align proposition with work, expression with sentence, manifestation with encoding, and item with inscription? This is a plausible ontology for any type of expression. However, there are two problems: there are many levels of encodings, and each one is a type of "symbolic notation" and therefore similar to an expression. We find that in FRBR both expression and manifestation are symbol structures, not entities. It is the context of the symbol structure that makes it an expression or manifestation. It is therefore helpful to distinguish between roles and types, e.g. "person" is a type, "student" is a role, a role can end without making the person cease to exist, a type cannot. We can refactor FRBR to represent propositional content (work), symbol structures on syntax and encoding level (expression/manifestation), and inscription (or patterned matter and energy) (item). The syntactical expression and encodings could therefore be repeated almost indefinitely and allow for very precise definitions of any type of expression such as in a dataset. The contexts for these operations, or interpretive frames, are inert in the person encoding the proposition. A community with a set of intentions, actions, and behaviours is required to contextualize these expressions and make sense of them. In practical terms, the demonstrated model (systematic assertion model) is currently applied in developing a preservation environment as part of a funded project. Ontological foundations as the ones described here can provide the needed precision when discussing information objects, but they will also replace a familiar world of idioms and metaphor with a stranger world of unfamiliar terminology.

Thursday presentations

"Modelling frequency data: methodological considerations on the relationship between dictionaries and corpora", Karlheinz Moerth, Gerhard Budin, Laurent Romary

The work presented is on two interlinked projects of the Austrian Academy of Sciences and the University of Vienna. The Vienna Corpus of Arabic Varieties (VICAV) allows for the comparative study of Arabic dialects and pools linguistic research date such as language profiles, dictionaries, glossaries, corpora, and bibliographies. The Linguistic Dynamics in the Greater Tunis Area project is a three-year funded project that aims to create a corpus of spoken youth language and the compilation of a diachronic dictionary of Tunisian Arabic. Both projects employ the TEI P5 for both the dictionaries and corpora. The main interest is how to best link the dictionaries to the text corpora in order to retrieve statistical and frequency information. A corpus is a reorganization of the underlying primary text sources and a way to group them in meaningful ways. Documenting the underlying sources is an important part of the creation of corpora. But there is an ontological collision: what is a corpus with regard to linguistic data? Is the dictionary which references the source materials the corpus? For the purpose of statistical analysis of corpora the process of tokenization is crucial, there needs to be agreement on what counts as a token in a dictionary or what counts as a lemma. The total numbers of these are necessary for contextualizing the corpora and make the frequency information explicit and meaningful to users of the corpora. These frequencies and ranks are expressed in TEI by the <extent> and <meassure> elements. Any queries of the data in the dictionary and the corpus need to be aware of the deep internal lexicon structure to be able to fully exploit the diplomatically encoded and manually added information about individual entries and the lexicon as a whole. These parts of a dictionary entry are relevant: 1. lemmata level, 2. inflected word forms, collocations, multi word units, particular senses. How do we register these particular items? The tenets of frequency/statistical information are the location for the annotation, its value, rank, provenance, the retrieval method used, query type, and evaluation. We need persistent identifiers for all of these tasks. There are three possible approaches to recording all this information: catch-all elements, feature structures, or a TEI customization. Only the feature structure approach has yet been implemented. Further steps include a start on the schema customization process and to discuss it in the community to possibly arrive at some agreements and ideally a new section in the TEI Guidelines.

"A Saussurean approach to graphemes declaration in charDecl for manuscript encoding", Paolo Monella

This talk focusses on how to best declare writing systems in the TEIheader section for the description of medieval and ancient manuscripts. The main issue is that the TEI Unicode compliance principle is not sufficient to define graphemes in pre-print writing systems. Slight differences between the alphabets in different manuscripts, e.g. "u" vs "u"/"v", or different uses of punctuation are a real issue in the description of writing systems, as the same Unicode point can be used differently in different writing systems. We need to compare writing systems not only based on Unicode but in different ways that take these subtle differences into account. In Unicode "u" is "u", but in writing systems "u" can be "u" or "v", it is vital that the use of language in a writing system is made explicit in every instance. Corpus-wide normalizations are too unexact to be useful beyond the most simplistic task of making texts portable and searchable. What is necessary is a possibility to record a complete representation of a writing system for any manuscript, therefore every single character or glyph needs to have an entry in the charDecl. Every character needs to have a mapping to Unicode, information about its expressions and content in a linguistic sense. The issue is that for this approach, we cannot rely on the "a" as a Unicode character any more, instead it should be possible, for those who want to, to allow the referencing of the character declaration for every character in a MS. There is a huge interoperability issue with this approach, however, this can be resolved by doing search and retrieval not based on the text in the body, but by the mapping of the writing systems in different manuscripts. This way we can truly "normalize" texts without compromising the authenticity of individual text bearing objects. The issue is not to introduce new glyphs or characters into Unicode, it is about making mappings between writing systems that encompass all characters. This is ongoing work on the encoding of ancient and medieval manuscripts and hopes to resolve the many complex issues involved, not only the ones presented here, but also e.g. the occurrence of several writing systems in one manuscript in the case of multiple scribes.

Friday presentations

“Xquerying the medieval Dubrovnik”, Neven Jovanovic

The aim of this work is to make the medieval past of Dubrovnik more accessible to researchers. This is done by encoding the most relevant records of the city of Dubrovnik, the Acta. The texts are encoded in TEI/XML and are subsequently queried with Xquery. This talk explains the process of this undertaking. Hundreds of handwritten volumes of council records, mainly in Latin and covering the period from 1301 until 1808, have never been published in their entirety. It took the efforts of two learned societies more than 130 years to publish the first 30 volumes of the Acta, with very slow progress. The limitations of the printed book meant that lots of compromises needed to be made, and summarizations and omissions of supposedly irrelevant or uninteresting materials occurred. All these problems led to the proposal to undertake the encoding of the entire council records. As a pilot project, Vol. 6 of the acts was encoded, which covers just 6 years of the period. The aims are to prove that new knowledge can be generated from the combination of text and markup, to produce a complete electronic edition of the acts, to enable users to create their own versions, to investigate analytical processes of the richly encoded sources, and to open the texts up to encourage use and re-use. The pilot volume has been published online as a Bitbucket repository where the encoding principles and decisions are explained. The documentation contains many examples of the encoding that make some of the editorial decisions transparent. Places, persons, and important events are all marked up. Time, space, and currency meassures are also marked up. The project isn't innovative in its use of TEI, but it aims to be novel from a historiographers' perspective: facts are recorded also in relation to their linguistic expression, different problems of historiography are addressed, and all project outcomes will be published for users to review and re-use. This should make any findings reproducable and understandable. BaseX is the native XML database used in the project and it allows for the copying and pasting of Xqueries directly from the BitBucket repository into the database. These can be small queries that retrieve a list of items to queries that return entire HTML pages from a query. Xquery has a steep learning curve, but the results promise to open up a resource that has been under-researched and deserves more public and scholarly attention.

“The Lifecycle of the DTA Base Format (DTABf)”, Susanne Haaf and Alexander Geyken

The context for this presentation on the DTABf is the DTA (Deutsches Text Archiv), funded by the DFG and a partner of CLARIN-DE, for which it is the base format. DTABf is a TEI P5 format used in the entire corpus and a prerequisite for every text to be incorporated in the archive. The core corpus of DTA contains first editions of German texts from the 17th-19th centuries. These are digitized in house and have been encoded using a reduced tagset by double keying. There is also an extension corpus from several other resources and projects that are being integrated into the DTA. 1,014 works have been digitized so far. The historical German texts come from a wide variety of sources with a wide variety of encoding formats. DTA is aiming to make all of these texts available as a single coherent corpus. The main categories of texts are: 1. texts following the DTA guidelines already, 2. other TEI formats, 3. other non-TEI formats, and 4. OCR-born large collections. All these are converted into DTABf for the purpose of homogeneiety, and high standards of metadata are necessary for this conversion. TEI Lite and TEI Tite were ruled out as possibilities for a base format due to concerns about scope and interoperability. The DTABf has developed out of the texts that were already available in the corpus. The documentation was created reflecting the experience of encoding these varied texts from such a large time span. DTABf has been continuously adapted to new phenomena, consistent encoding guidelines are applied to all texts. It has a tagset of 80 elements, 25 teiheader elements, a restricted selection of attributes and a closed set of values. Components of DTABf are: ODD, RelaxNG, documentation, and guidelines for the text transcription and encoding. Wikisource syntax has also already been transformed into DTABf. New challenges remain and come up regularly, new text forms, such as funeral sermons, pose interesting tests for the format. DTABf has four different levels of encoding to allow for various depths of markup and analysis. DTABf offers a DTA-oXygen-Framework for editing, proofreading and a web-editing facility, web form for metadata creation, and supports object data such as serialization routines for use in NLP tools. Workshops are offered for training, and presentations are given regularly. The DTA corpus has also been integrated into the CLARIN-D service centre, conversion routines have been developed to map DTABf to the CLARIN-D required format.