Innsbruck Computer Archive of Machine-Readable English Texts

ICAMET

(Innsbruck Computer Archive of Machine-Readable English Texts)

Manual of the Innsbruck Middle English Prose Corpus

Version 2.4

Manfred Markus

English Department, University of Innsbruck Austria

Innsbruck, 2010

Preface 2010

For a table of contents of this file please open the menu card LAYOUT and then DOCUMENT STRUCTURE in your WORD program. You will then see the chapters and subchapters of this manual in a separate window.

The following manual is an abridged and substantially revised version of the published booklet of 1999 (Manfred Markus, Manual of ICAMET. Innsbrucker Beiträge zur Kulturwissenschaft. Anglistische Reihe 7. Wien: Braumüller Verlag). The new version refers to the Innsbruck Prose Corpus only; the Innsbruck Letter Corpus will be published separately elsewhere.

The sampler provided on the distributable CD ROM comprises only 139 text files of the complete corpus, both in a doc-version without “cocoa” headers and in an rtf-version with the headers. The latter version has been added for users who do not presently work with Windows XP, as well as for users of Raymond Hickey’s program Corpus Presenter (see below), which accepts only rtf-files for analysis.

The Innsbruck Prose Corpus as a whole now consists of 159 files, all of which, due to copyright restriction for some of the texts, are accessible only on a CD ROM at the English Department of the University of Innsbruck. Chapter 5.2. below provides information on which files of the total corpus could not be included in the present sampler.

The earlier sampler of Middle English prose published on the 2nd edition of the ICAME CD ROM in 1999 was in need of revision for two main reasons: the number of text files was considerably smaller than in the subsequent versions, which have profited from the recent permission of the Early English Text Society to include in the Innsbruck Prose Corpus 23books published by the EETS and still under copyright protection. The exact titles are identified in the survey table below (5.2.). I am most obliged to the EETS for this act of benevolence. The other reason why the first version of the Innsbruck Prose Corpus had to be revised is the fact that some of the special characters used in the original DOS files came out modified into “hieroglyphs” in the WinWord version published by the HIT Centre in Bergen in 1999. I apologise for this defect and am now pleased to announce that the problem of the fonts has been solved in the present version. As regards the copyright barrier, I am still hopeful that Oxford University Press will allow the "fair academic use" and distribution of the digitised files of EETS books still under copyright protection.

As in the past, the Innsbruck Middle English Prose Corpus is here offered to the international community of researchers for scholarly purposes on a non-profit basis. Whoever uses the corpus as a whole or in parts for publications is kindly asked to send me a message of information about this fact.

The rtf-versions of the present sampler are furnished with COCOA headers. These headers allow analysts to filter out files according to any of the 26 parameters offered for this purpose. For the implementation of the headers in the two Innsbruck corpora I am much obliged to Raymond Hickey, whose program Corpus Presenter was published in bookform (cf Hickey 2003). I would also like to thank Hans-Jürgen Diller, who has been one of my most critical users of the corpus over the last few years by referring me to mistakes and incongruities, particularly in the headers. As far as these are concerned, I am also very obliged to one of my postgraduate students in Innsbruck, Andrea Leonie Krapf, for helping me to proofread them. Lack of time and money caused the headers to be anything but perfect and completed.

While there have been experiments with normalising some of the texts of the Innsbruck Prose Corpus, we havenot been able to do this systematically for a larger number of texts. This task, which in principle can be solved with the help of recent programs, such as Corpus Presenter (Hickey 2003), has to be left to individual users’ own initiative.

Manfred Markus, University of Innsbruck, Dept. of English, January 2010

Preface (05/1997)

ICAMET, the Innsbruck Computer Archive of Machine-Readable Texts, has three parts:

(1)the Innsbruck Prose Corpus (1100 to 1500)

(2)the Innsbruck Letter Corpus (1386 to 1688)

(3)the INNSBRUCK varia corpus (still in preparation)

The three sub-corpora are fairly unequal. The Prose Corpus consists of 159 full-text data bases, usually complete books, of nearly 6 mill. words altogether.

The Letter Corpus, considerably smaller, contains 254 letters of a total of 110,307 words (2006: 337 letters of a total of 146,183 words). The Varia Corpus is a potpourri of a dozen or so translated, normalised, tagged or alternative versions of the texts, particularly of those in the Prose Corpus; also, some texts ended up in the varia section, because they belatedly turned out to be in verse or post-1500.

This manual is only concerned with the PROSE CORPUS (ICAMET proper, so to speak).

ICAMET was supported by the Austrian "Forschungsfonds" in its initial stage, namely for two years from 1992 to 1994. While I am most obliged for this support, the reason why the project could not thrive as planned, has to do with the reduced amount of funding from the very beginning and the abrupt cancellation in the late summer of 1994. The original schedule for the Innsbruck full-text data base of prose was to compile considerably more files than can now be presented to the international community and also more representative ones. Moreover, we were confident of getting full copyright permission by the EETS for fair academic use.

As things have developed, the Innsbruck Prose Corpusis as yet incomplete and more fragmentary than intended. Some text types are not at all or insufficiently represented, so are the 12th and 13th centuries vs the 14th and 15th, and the copyright question, concerning about 2/3 of our texts, has not fully been solved yet and caused the unpleasant delay of the publishing of this corpus. Unlike many commercial distributors of CD-ROMs, we have not deliberately selected old editions only to bypass the copyright problem. As a result, for the time being some of our texts are not freely available, whether on CD-ROM or on the Internet. But all texts of the Innsbruck Prose Corpus, including those that are still under copyright protection, can be used by researchers in Innsbruck itself. The regularly updated details concerning the availability of the Innsbruck Prose Corpus can be found on our Internet page (address:

www2.uibk.ac.at/fakultaeten/c6/c609/projects/icamet).

Innsbruck, May 1997Manfred Markus

Acknowledgements

While we are still hoping for access to all the books of the Early English Text Societyincluded in the Innsbruck Prose Corpus, we highly appreciate the permission of the Council of EETS to use a subset of the EETS volumes still under copyright protection, namely 23(see the table under 5.2.); for this I am particularly obliged to Richard Hamer of the EETS. Moreover, the distribution of our corpus texts was graciously licenced by other publishers, in particular, the Universitätsverlag C. Winter, Heidelberg, for their series Middle English Texts (General Editors: Manfred Görlach and O.S. Pickering), and James Hogg for some volumes of the Salzburg Series. I would also like to thank the following publishers for receiving permission to include singletexts in our corpus: Almqvist & Wiksell's Boktryckeri, Upsala; Eynar Munsgaard Publishers, Copenhagen; Milford Publishers, Oxford & London; Garland Publishing, New York & London; Martinus Nijhoff, The Hague; The University of Leeds; Oxford University Press; and Cambridge University Press.

I am also very much obliged to many helping hands, without which, needless to say, the project would not have materialised. Some of them were major members of the project, but most of them cooperated on a short-term basis (so-called "trainees" and "tutors"), and they have been too many over the years to be all mentioned individually. My particular thanks, however, go to Roland Benedikter, Andrea Kruckenhauser, Robert McColl Millar, Ulrike Mühlbacher-Nadenik, Paul Perger, Eva Maria Rainer, Elliot Schreiber, Gerda Schütz, Maturot Sinavarat, Aloys Wechselberger, and Claudia Herzog.

I would also like to thank both the Department of English and the Faculty of Letters of the University of Innsbruck for their infrastructural support and the hardware needed in a project like the Innsbruck Middle English Prose Corpus. Moreover, I am very much obliged to the "EDV-Zentrum" and the "Subzentrum" of the University, in particular the late Georg Anker, for software and know-how support.

Of the various other people that were of help "warewise" (i.e. in matters of hard- or software), I would like to mention Mario Andriollo and Josef Wallmannsberger, both temporary Innsbruck colleagues of the English Department. Moreover, I am aware to have profited a great deal from several student participants of my classes on computer philology, taught in Innsbruck over the last twenty years; in this dynamic academic field, where teachers can particularly learn from their students, it is not a mere gesture that I express my thanks to them.

As researchers in residence, some of the users of the full version of the Innsbruck Prose Corpus,‘in particular, Hans-Jürgen Diller, have obliged me by giving us feedback concerning mistakes in the corpus; we are most thankful for this kind of cooperation.

Finally, I have often felt inspired and encouraged during my participation in various conferences of the last few years, among these a series of ICAME conferences and, as far as historical linguistics is concerned, several conferences initiated or even convened by Matti Rissanen of the University of Helsinki; so not the least of my thanks go there and to the ICAME organizers. It also has to be acknowledged that The Helsinki Corpus of English Texts, initiated by Matti Rissanen and Merja Kytö, was the first of its kind, thus paving the way for many other historical English electronic corpora to follow.

0.Introduction: organisational

The method of compiling this corpus was basically the following: the selected text was first scanned, either directly from the book or, in some cases of bad quality of pages (e.g. uneven patches), from a xerox copy made for the purpose. The scanner was a Siemens Highscan 400 machine; the programme for font recognition was, after an initial phase (1991) of experimenting with OCR Recognita and Optopus, PROLECTOR. It is not so easy to handle, but flexible as to the special needs in view of "badly" printed old books[1].

The scanned texts were all manually corrected by two people and given a normalised format/layout. Correction always meant reading against the original edition at least once; shorter texts up to 100 pages were read independently twice. With longer texts I functioned as the second corrector by at least checking the degree of reliability of the first correction. The work of correction partly included applying global commands and WORDCRUNCHER index lists (cf. Markus 1994). While all contributors to the project gave their best, no hundred percent reliability could be reached. Almost all the members of the changing team were non-specialised in Middle English orthography. But then even well-made present-day books are not a hundred percent perfect. All in all, the Innsbruck Middle English Prose Corpus and the InnsbruckLetter Corpus, far from being a mechanical reproduction of edited texts, are reliable enough to be used as bases for scholarly work. Benevolent users are invited to report any mistakes found in the corpus texts to my e-mail address: .

1. Principles of compiling the PROSE CORPUS

The Innsbruck Prose Corpusis a full-text data base, aiming at target groups of users who, unlike those of the Helsinki Corpus of English Texts (cf. Kytö 1993), are not only interested in extracts of texts, but in their complete versions. The corpus thus allows literary, historical and topical analyses of various kinds, particularly studies of cultural history, but it also invites linguists to raise questions, e.g. of style or rhetorics, for which one would want a lengthier piece of text, or its beginning and ending.

The corpus is a selection of Middle English prose. The texts are, therefore, relatively free from poetic stylisation (in spite of the occasional role of the alliterative and formulaic tradition). The language of this prose can be assumed to be closer than that of poetry (if not close) to the way language was really used in speech. It also represents the many varieties, both in speaking and spelling, of Middle English as used in special text types and for special occasions.

1.1.Overall structure

The corpus comprises 159 files, representing 131 texts as found in scholarly editions, widely those published by the Early English Text Society (EETS). In line with editing habits, texts normally have book length, and they are accordingly identified in our corpus by an individual number. Only in some exceptional cases, like with Caxton's collected prologues ("caxtpro"), have we given one ID-number to a collection of (usually short) texts. On the other hand, many-volume texts (like Pecock's The Donet) which were published separately under different names are stored volume-wise in the corpus and have accordingly been given different names and ID-numbers.

Text versions based on different manuscripts and presented synoptically in some editions have been given one number only, which is, however, specified by additional letters a, b, c, etc.

1.2.File names

All the files have names within the 8-letter DOS mode. If the author is known, the first letters of the name of a file usually suggest the author's name before the title of the work, thus myrcseve means "John Myrc, Seven Questions...". In the case of works by Chaucer, however, the question of the manuscript used for an edition has come in. Here the file name usually includes reference to a manuscript or edition. So persske is the name for Chaucer's Parson's Tale in the edition by Skeat. The more common case is that of anonymous works. Here the file's name suggests either its title only or the title plus the manuscript; thus, ancrenero means "Ancrene Riwle, MS Nero". In the rare cases when author, title of work and manuscript have to be named, the manuscripts are merely suggested by the last letter (a, b, c, etc.). By the same token, texts which were too long for saving on one diskette were split into different parts identified by final running numbers after the proper name. Thus, trevdia1 means "Trevisa, Dialogus inter militem et clericum, Part 1".

The files of the corpus, arranged alphabetically and according to other principles, are listed in chapter 5 below.

1.3.The compiler's dilemma: authenticity vs. retrievability

On the one hand, the specific characteristics of a manuscript as reflected in the used edition are to be presented on the screen as authentically as possible, with all the deficiencies or alleged deficiencies of the manuscript. On the other hand, medieval scribes and editors of medieval texts reveal different policies of encoding, so that the output is anything but consistent. In order to make things retrievable for the computer, some of the practices of both scribes and editors have to be emended and many of the coincidental characteristics of fonts, format and layout have to be given up. This may be illustrated by two examples.

Medieval scribes, and partly the editors with them, did not always pay attention to the unity of words, for example, by cutting them to pieces from one line to the next without hyphenation. While it is not entirely clear whether strings such as there fore are to be considered as one word or two in medieval texts, we have linked obvious constituents of words, marking the intervention by an underscore hyphen in parentheses (_). We have used an underscore hyphen without parentheses in those cases where a word has been split by the editor through the breaking of line and has been syllabified. We were thinking of lexical analyses and crunching programs, which, without our intervention, would index words like house/hus and bond where the text really has or intends husbond.On the other hand we did not want to give up the line-breaking of editors altogether. Keeping the lines of the editions not only made our task of proofreading much easier, but will also allow future users easily to check texts against the editions used in the given cases. Moreover, sticking to frozen lines (preferably those of the editions) will allow a later semi-automatic production of normalised interlinear lines. While these are not (yet) provided in the present version of the Prose Corpus, I have occasionally reflected the possibilityand necessity of such additional normalised linesfor Middle English prose texts (cf, for example, Markus 1997).

A second example of unavoidable emendation is the inconsistent use and representation of initials in manuscripts and editions. While ignoring them would be a real loss in view of their possible function as markers of the beginning of chapters, the editors' various ways of marking, or commenting on, different types and sizes of initials had to be given up for the sake of their being retrievable. If we had reproduced a dozen or so different ways of marking initials, how could the user have been in a position to find them? So we have used just one coded marker for initials, namely <b> (for "boldface") (see below 4.1.).