František Čermák

The Institute of the Czech National Corpus, Charles University Prague,

KGA 03 (2011), 33–44

Some of Current Problems of Corpus and Computational Linguisticsor Fifteen Commandments and General Truths

A lot is going on in the field and research here seems to be almost without limits. Much of this is due to vast possibilities offered by computers, corpora, Web etc., much is due, however, to a disregard, evident with many, for what real linguistics has to say and offer, whether this might be a good sign of bridging the gap or not. Recent corpus-based grammars and dictionaries may be viewed as a good sign, there is no doubt about it.

Likewise, there is no doubt about that a number of insights, techniques and some useful results have been achieved in the field over years that are of real interest, inspiration and insight for linguists, such as machine translation, to name just one field (which in itself took decades to get to its present and limited state while linguists have always been behind this). Due to its different orientation, more linguistic in character, it is also evident that corpus linguistics is no longer a mere branch of computational linguistics, although there is a lot of overlapping and mutual inspiration. The praise along these lines is not difficult to formulate but are we really after praise and feeling of self-satisfaction only?

However, despite a considerable progress recorded here, some of real problems that seem to be of importance persist and new ones have emerged. In what follows, some of these are summarized and presented in a special and, perhaps, idiosyncratic list of problems which I choose to call commandments because of their urgent nature. In an attempt to sum up in a terse and perhaps somewhat aphoristic form, some of the most salient aspects, characteristic of the present state of affairs, which is based on sobering and often disconcerting experience of recent years, are offered here in a nutshell, followed by comments. Technically, this contribution happens to be a build-up and modification of a similar but much smaller list and survey published five years ago (Čermák 2002).

(1) Garbage in, garbage out.

A Comment: This is an old, general and, hopefully, broadly accepted experience and re-interpreted knowledge, having, in the case of corpora, a number of implications related to data and their treatment. One cannot expect reasonable and useful results from bad data fed into computer either because of their skewed nature and because they are not representative of the phenomenon to be described or simply because they are not sufficient. A special case of the latter is a wanton choice of data, artificially cut off from other although these are interconnected. This leads to another point. To make things worse, once a clumsy query or a badly formulated algorithm is used the results one may get may be ever more useless.

(2) The more data the better. But there is never enough data to help solve everything.

A Comment: This points to a need for more and also alternative resources, such as spoken data. However, one has to discern between data proper and their necessary contexts. Some language phenomena, even words, are quite rare and require a lot of recorded occurrences, should a decent description be achieved, while on the other hand some data are to be found in special types of texts only, including real spoken data. However, once having, perhaps, enough data, you are faced with a rather different challenge. You will never get at the meaning hidden behind form directly, its elusiveness being often a source of vexation. To get a help, you will have to resort to larger contexts, often extralinguistic in nature, use your power of introspection and knowledge of the rules of ellipsis, etc. The worst type of this is found in spontaneous spoken data not provided, as a rule, by external contexts. Hence, the call for maximum data is also an expression of hope to find help there.

A Corollary: A popular blunder committed over and over again is to believe that once we are dealing with written texts (which are so easy to come by) we are dealing with the whole language.

For a number of reasons, historical and informational, the truest form of language is the spoken (spontaneous) language. In this, true linguistics, including corpus linguisticts, has just barely started and the lack of this kind of data is appaling.

(3) The best and really true information comes from direct data.

A Comment: There is always the alternative of pure unadulterated texts, forms devoid of any annotation, which many people are extremely afraid of, pretending these are not sufficient for their goals, while, in fact, they do not know how to best handle them. The other alternative, annotation, now offered as the primary solution, is, to a varying extent, always biased and adulterates both the data input and results obtained. Hence, annotated texts should always be viewed as an alternative only, although a fast and popular one. Fast may often mean superficial, simplified, or, distorted, however.

A Corollary:Don´t sneer at plain text corpora, since many kinds of information is to be got only from them. Linguistically, tagged texts are always a problem and a great misleader. Much of textual information is lost by tagging and lemmatisation; moreover, certain semantic collocational and other aspects, being found with word forms only, do not obtain with their lemmas. Computational linguists seem to be obssessed by ever improving their tagging and lemmatization scores but forget to measure the degree of information loss because of this. Perhaps, they do not know how to do it, provided they recognize this as a problem. Any markup is an additional information only and there might be as many markups of the same corpus as there are people available and willing to do it. The result, should this be undertaken, will be an array of rather different views while the language behind remains the same.

Another Corollary:Data are sacrosanct and should not be tampered with, even with the best of intentions. Yet, some computer people find odd text mistakes, seeming misspellings etc. so tempting that they are inclined to „improve“ the bad and naughty data to fit their nice little programmes better. There is always a reason for this quality of data, most often because we are dealing with natural language which is imperfect by definition. In fact, we will never know for certain how much tampering the plain data, once they left the author, have gone through, because of too dilligent, idiosyncratic editors, censors, programme failures etc.

One of Murphy´s laws puts this rather succinctly: In nature, nothing is ever right. Therefore, if everything is going right ... something is wrong. Naive and at the same time dangerous people endorsing this view may look upon the first part of the law as a justification for what they are about to do, while it is the second part that a real linguist should take seriously, both as a principle and a warning.

To widen the gap between plain and annotated data even more, one should remind oneself, as many corpus lexicographers do now for some time, that much of sense and meaning of words is associated with specific word forms or their combinations only, and cannot be generalized to lemmas.

(4) Representativeness of a corpus is possible and, in fact, advisable, being always related to a goal.

A Comment:This has largely been misinterpreted to mean that a (general) representative corpus is a nonsense. Well, it is not. There are in fact as many kinds of representativeness as there are goals we choose to pursue. What one does not usually emphasize is that representation vis-a-vis a general multi-purpose monolingual dictionary is a legimate goal, too. Otherwise corpus lexicographers would be out of business. The whole business of lexicography is oriented towards users offering them a fair coverage of what is to be found in texts and, accordingly (hopefully) in dictionaries. Coverage of a language in its totality requires a lot of thinking and research about what kind of texts for a corpus-based lexicography must be used and in what proportions. But there are other kinds of representativeness, of course, perhaps better-known for their more clear-cut goal.

In a sense, it is thanks to modern large corpora that one may, though still only to a degree, envisage and perhaps even try to achieve some sort of exhaustive coverage of language features being studied. Superficially and to a lay linguist, that would seem to be only natural. Yet with many approaches, being content with partial, limited and therefore problematic results, the very opposite seems to be their aim, unfortunately.

(5) There is no all-embracing algorithm that is universal in its field and transferable to all other languages.

To believe this, one has to be particularly narrow-minded, a fact which is, understandably, never admitted. Yet, some do act sometimes as if this were true.

A Corollary:Language is both regular and irregular and there is no comfortable and handy set of algorithms to describe it as a whole.

In fact, since language is not directly formalizable as a whole, techniques must be developped to handle the rest that isn´t. The linguist´s task is to delimit the necessary line between and decide on appropriate approaches. Blind and patently partial trials on a small scale to prove one´s cherished little programmes and algorithms are useless and waste of time, mostly. Although statistical coverage is often a great help it generally is too rough, too, and statistical results as such cannot stand alone, without further filtering and interpretation.

Another Corollary: Formalisation must not be mistaken for statistics.

Next to differences between cognate languages, there are major and more serious typological differences. For instance, no one has so far been able to handle complex agglutinative text constructions of little-known polysynthetic languages where boundaries between morphemes, words and sentences collapse (overlapping being a too mild label to use here).

Another Corollary: There is no ontology or, rather, thesaurus design possible that would be freely interchangeable between all sorts of languages.

This rather popular approach of ontologising as much as possible, often drawing on developments of Word-Net, a move both to be applauded and principally criticised, is both antilinguistic in that it openly ignores the several centuries-old tradition of thesauri, and ill-based in that it often is a medley of impressions and some knowledge adopted from existing dictionaries or, even, from English. Of what these ontologies are really representative, is a very much open question. Admittedly, some might be useful, despite drawbacks.

(6) Tagging and lemmatisation of corpora, however useful but with obvious drawbacks, will always lag

behind needs and size of corpora.

A Comment: Steady growth of corpora and requirements brought forth by constant flow of new forms and words in new texts will always lead to imperfect performance of lemmatizers and taggers. An obvious problem that will not go away is to be seen in the ever new variants coming in as alternatives as well as foreign imported words and proper names.

A fact well-known, but hardly ever mentioned let alone subjected to a serious analysis aimimg at a solution, are hapaxes which do not seem to go away. Since their number is so high (up to 50 per cent of word forms, even in corpora with hundreds of millions of words) it follows that in order to make for the information loss there, corpora should be much larger. It also means that a, say, 100-million corpus represents only a fragment of reality and, partly, of language potentiality that is used as a basis for search and research. In a way, hapaxes, following Zipf´s law, are inevitable and their number, usually not published, does say a lot about how language is treated automatically. Hence,

A Corrolary:It seems that the smaller the number of hapaxes is the more suspicious the treatment is.

Unlike English, Chinese, Vietnamese etc., most languages having some kind of inflection and offering a lot of homonymous endings, seem to be in constant change.

Perhaps the largest and, in fact, enormous field, which has not been covered in any language in full, are multi-word units and constructions, including idioms (phrasemes) and terms. There is no language that has covered this to any degree of satisfaction, so far. It is obvious that these are systemic units (lexemes) and that it is dangerous to wave them away as mere collocations of specific word forms. The truth is we often do not know where to draw the line between normal open-class collocations and those that are anomalous. Moreover, some words do not exist in isolation and one may reasonably view them only as part of larger units, mostly idioms. This leads to a categoric demand that more types of language forms are recognized in a linguistic corpus analysis than just one or two only.

Taking all of this into account, this appears to be a formidable task to grapple with. Hence, saying that a corpus is tagged and lemmatized is rather immodest.

(7) Don´t get fascinated by your own system of coding, it may not be the only one, neither the best one.

A Comment: There are so many authors having such brilliant theories (clashing however with each other) that one should be wary of. Moreover, the theories used in applications do not always come from the best theoreticians available. Rather, they may be concoctions, often recooked with new ingredients, of a number of partial and ad hoc technical solutions standing in some distance from the language reality.

A Corollary:The more interpretation in various approaches a phenomenon gets the more unstable and suspicious its solution may be. This particularly holds for syntax. But it holds more generallly, too.

Another Corollary:Black-and-white and yes-or-no solutions are never foolproof.

Instead, more options offered along a scale would often correspond more closely to language facts. It seems that, due to history, linguists have originally been forced by the machines to this strange black-and-white binary approach almost everywhere. But isn´t it high time to reverse the state of things and force machines to view facts as (at least some) linguists do?

(8) Lemmatizers have invented imaginary new worlds, often creating non-existent monsters or suggesting false

ones.

A Comment: It is over-generation, mostly, lurking behind this problem that presents a serious danger criticised by linguists. But this must not be mistaken for language potentials and potentiality reflecting, among other things, future development of the language in question. Languages never follow recasts, let alone rule-based forecasts.

Obviously, the over-generationmust be avoided as it creates fictitious worlds being hardly ever part of what is possible and potential. Over-generation, depending on the language in question, is perhaps best known from languages with rich inflection in that the algorithms suggest, for instance, a fictitious lemma for a form or label a form as part of an inflectional paradigm that does not exist in reality, however.

A Corollary:On the other hand, language potentiality is not identical with a wild boundless overgeneration.

The prevailing binary yes-no approach basically prevents any change in this. A headache, if perceived by computational linguists at all, is a frequent and natural variability of forms, closely related to this. Much of this holds for morphological tagging, too.

(9) It is not all research that glitters statistically.

A Comment: Statistics can be a great help as well as a great misleader, e.g. in the business of stochastic tagging. At best, any numeric results are only a kind of indirect and mediated picture of the state of language affairs, often based on too small and haphazard training data much of which linguists still do not know how to translate into linguistic terms. The tacit assumption that a tagger, trained on a million forms, will really grapple with a corpus of a hundred million words is false.

A Corollary:There is no such thing as a converse of reductionism.

That in itself may often be a help, though, perhaps, elsewhere in language. This, in turn, is related to sample and its type. Linguists should not get intimidated by formalisms by which a sample should be arrived at and insisted on, if pure statistics has its say. Statistical formulas are just too general and insensitive to many factors, including type of text, frequency and dependence of the phenomenon pursued, etc. This also holds for corpora based entirely on samples: advocating this is often asking for trouble as some texts make sense as a whole only.

Another Corollary: There is no universal formula to easily calculate your sample. Before accepting it, use your common sense.

(10) Language is both regular and irregular, not everything may be captured by algorithms automatically.

A Comment: Apart from irregularities in grammar, this points, yet again, to the very much neglected field of idioms, mostly, and grey zones of metaphoricity. The need for stepwise, alternative approaches along a scale is obvious. Due to its theory-dependence and lack of general methodology, no one knows where the border-line between regular and irregular, rule and anomaly, etc. really is. Obviously, blind formalistic onsloughts on everything in language are gone, but do all people realize where to stop and weigh the situation? Here, statistics is often too tempting for some to be avoided, as it will always show some results, no matter how superficial and useless.