What is a word?

Richard Hudson,

Dept of Phonetics and Linguistics,

University College London,

Gower Street,

London WC1E 6BT

What is a word?

Abstract

Words typically combine properties from different levels, so they provide the simplest point of contact between levels. On the other hand, there are well-known mismatches between the word-like units of syntax and those of morphophonology, and similar mismatches can be found between other word-like units. How should linguistic theory accommodate these mismatches without losing sight of the matches found in more typical cases? And what kinds of words should we recognise? I offer a general theory of word-types which is not tied to any particular theoretical framework (though it uses some notation and concepts from Word Grammar). I also discuss the Lexical Integrity Principle, and suggest a new formulation of the principle which may explain why phonological and morphological structure are `invisible' to syntax. I contrast the details of this analysis with two recent discussions of word-types based on Lexical Functional Grammar.

What is a word?

1. Background

What is a word? Linguists certainly ought to be able to answer this question, as we all mention words in our grammars, and for some of us the word is the most fundamental unit of language. It is easy to argue that the question has no good answer because it has too many answers, all reasonable but all incompatible with each other. For example, how many words are there in "I'm tired"? By some definitions there are two, by others three - but who's to say which of these definitions is the right one? This looks like a good reason for rejecting the notion `word' out of hand as a part of scientific linguistics; and yet the fact is that we all use it, and most of us use it a lot.

Another approach to the question is to focus on typical words. After all, most words (at least in English) are words by any imaginable criterion. Take the last one in the previous sentence: criterion. This is a word for at least the following reasons:

(1)aIt is bounded by a word space and a punctuation mark.

1

bIt has a single word-stress, which is fixed lexically.

cIt corresponds to a lexical item; in lay terms, it would have a separate entry in a dictionary.

dIt is a noun.

eIt is directly related by syntactic rules to two other words, any and imaginable.

fIt has a sense, the notion `criterion' - i.e. the same notion as the French word critère.

gIt cannot be broken down into smaller units whose meanings combine to give this sense.

hThe ending -on could be recognised as a marker of singularity, but neither it nor the rest of the word, criteri-, could be used on its own - i.e. it is a `minimal free form' in Bloomfield's terminology.

This example is not isolated; in fact, it is quite typical, and we have to search quite hard for the problem cases. The fact that so many independent characteristics typically coincide in the same unit suggests that these units are real and important. The word lives.

This approach is uncontroversial and correct as far as it goes, but it does not go far enough because we still don't know how to answer our original question: how many words are there in "I'm tired"? Clearly, we need a more sophisticated notion of `word' so that we can distinguish at least two kinds of words. Then we can say that there are two of one kind, and three of the other. This conclusion is also uncontroversial, but the devil is in the detail. What kinds of word are there, and how are they related to one another? The present paper is an attempt to answer these questions.

One of the main roles of the word is as the dividing line between morphology and syntax, with morphology responsible for its internal structure and syntax responsible for its external relations to other words. In the wild days of post-Bloomfieldian structuralism some linguists ignored this division, and present-day transformational syntax is in some respects a direct continuation of this tradition (Mathhews 1993:86), but I think everyone would now agree that morphology is to at least some extent different from syntax. A major issue in current research is how clear the division between the two is. Can syntax `see' the internal structure of words? And are word structures similar to syntactic structures? (In principle the two questions are distinct - word structure could be visible to syntax, but built on the same formal lines.) Linguists are deeply divided on these two questions, with some answering yes to both and others no to both. My own sympathies are with those who answer no, but I shall argue in this paper that the answer has to be somewhat complicated precisely because there are different kinds of word. I shall also suggest (rather tentatively) why word structure should be invisible to syntax.

The immediate stimulus[1] for the paper is the reading of two recent papers written in the framework of Lexical Functional Grammar (LFG): Bresnan and Mchombo 1995 and Mohanan 1995. Both argue for some version of the widely-accepted `Lexical Integrity Principle' (Carstairs-McCarthy 1992:90, Spencer 1991:42, 425, Di Sciullo and Williams 1987, Anderson 1992:22), which tries to define the division between word structure and syntactic structure; and both offer a sophisticated theory of words which recognises words at different levels. I shall disagree with some of the details, but this paper is intended to be not only a contribution to the same debate, but also a contribution in very much the same spirit.

Most of the discussion will be theory-neutral, but the general view of language is that of Word Grammar (Hudson 1984, 1990). One characteristic of this theory is the very controversial claim that language is a special case of general knowledge, so it is worth asking whether the problems that we face in defining the word can be matched outside language. Are there any concepts which are defined by a range of concepts which typically, but not necessarily, coincide? The answer is, of course, yes. Indeed, this is probably the normal situation for any concept, as has been repeatedly argued since the early research of Rosch (1976).

The classic case is the notion `bird', which allows us to acknowledge and explain the typical coincidence of wings, flying, feathers, beaks, two legs, eggs and nest-building, but there are exceptional cases like penguins and ostriches. Admittedly the example of `bird' is only indirectly relevant to the syntagmatic delimitation of words, because it involves the classification of things which are already identified as individuals: an ostrich is undoubtedly a whole bird rather than a part of one or a collection of birds. A better analogy outside language would be the notion `day', which combines a number of distinct characteristics - 24 hours, rising and setting of the sun, a unit of working, sleeping and eating, and so on. The units delimited by these various criteria typically coincide, but need not. As a measure of time we recognise a day from mid-afternoon to mid-afternoon, but this is not relevant either to sunrise and sunset or to work, sleep and food; and if we take sunrise and sunset as the bounds of a day (as we do when we contrast day and night), then the 24-hour period is irrelevant. But it would be perverse to use these possible mismatches as evidence against the notion `day' as a unifying concept, just as it would be to pay undue attention to the special needs of airline passengers and astronauts.

My aim in this paper, therefore, is to offer a theory of words which recognises the fundamental unity of the notions defined by the various levels, while also allowing enough flexibility to accommodate the various mismatches that are found in natural languages (and which can perhaps be explained in functional terms, though that is a different topic which I shall not attempt to tackle here).

2. A preliminary typology of mismatches

We start with a theory-neutral survey of the difficult cases. What kinds of phenomena break the simple coincidence of phonology, syntax and semantics found in dog and sees? Of course the question presupposes that we can identify word-like units at each of these levels, because otherwise we cannot recognise either harmony or disharmony. So that we can talk about these units we could call them `phonological words', `syntactic words' and `semantic words', though I shall offer a more sophisticated classification below. In principle we could also recognise `morphological words' (as recommended for example in di Sciullo and Williams (1987) and Bresnan and Mchombo (1995)), but in the vast majority of cases these will coincide with phonological words even if not with syntactic or semantic words, so for the present I shall ignore the cases of mismatch (discussed for example in Anderson 1994:18 and Spencer 1991:42), though I shall return to the distinction in a later section. In the meantime I shall use the rather cumbersome name `morphophonological word' to refer without distinction to morphological and phonological words. Another word-like unit is the lexeme, lexical item or `listeme' (Di Sciullo and Williams 1987), i.e. the unit of storage in competence (e.g. the word-form sees is an example of the lexeme SEE). But questions of storage and abstraction are matters of paradigmatic classification, so they are orthogonal to the essentially syntagmatic questions of this paper and I shall ignore them.

In the following discussion, therefore, I shall distinguish just three kinds of word-like units: morphophonological, syntactic and semantic. It will be helpful to have abbreviated names for use in formulae, so I shall call them respectively P, S and C (for `content'). We can now classify known mismatches according to the levels involved. For example, `S+S = P' means `two successive syntactic words corresponding to a single morphophonological word'. Table 1 summarises the cases that we shall consider.

mismatch pattern / phenomenon
A. S+S = P / clitics, incorporation, fusion
B. S = P+P / two-word compounds, `phrasal words'
C. C+C = S / ??
D. C = S+S / idioms
E. (S) = P / hesitation forms?
F. S = (P) / zero words??
G. (C) = S / expletives and other semantically empty words
H. C = (S) / arguments left implicit by omission of optional complements
I. S/S = P / words that require dual classification, e.g. participles
J. C/C = S / metaphors, loose reference?

Table 1

A. S+S = P: Two successive syntactic words corresponding to a single morphophonological word.

In the most familiar cases, the two syntactic words keep their separate morphophonological identities within the larger morphophonological word. The phenomena concerned are cliticization and incorporation, in which a `reduced' word joins a `host' word as part of a larger morphophonological word. It is called cliticization if the reduced word is a semantically empty word (such as a pronoun or illocutionary particle), and incorporation if it is a semantically full word such as a noun or adjective. Here are some standard examples, in which the square brackets surround the relevant morphophonological word and the word-spaces separate syntactic words.

We illustrate cliticization from French:

(2)Paul [en mange] beaucoup.

Paul of-it eats much. `Paul eats a lot of it.'

The pronoun en is a clitic, with the verb mange as its host. The most obvious and arguably the best syntactic analysis takes en as a dependent of beaucoup, with exactly the same syntactic status as a full prepositional phrase such as de fromage. However as far as phonology is concerned, the best analysis takes en mange as a single unit - clitic pronouns never have a separate word-stress. Moreover, if morphophonological words also provide the domain of morphology, as I assume, the rules for morpheme-order explain the rigidly fixed (and exceptional) position of en before mange. In short, there are good reasons for analysing en as a separate syntactic word, but equally good reasons for analysing it as part of the larger morphophonological word en mange.

For incorporation we turn to West-Greenlandic Eskimo (Sadock 1987:287):

(3)Hansi angisuu-mik qimme-qar-poq

Hans(abs) big-inst dog-have-indic/3s. `Hans has a big dog.'

In this example the `reduced word' is qimme, `dog'. According to Sadock, the best syntactic analysis would recognise this as a separate word, with angisuu-mik as a modifier agreeing with it in case (although qimme itself does not carry a case morpheme). In contrast, there are strong phonological and morphological reasons for not recognising qimme as a separate morphophonological word.

In both cliticization and incorporation, the reduced word is morphophonologically distinct from its host, although both form part of a larger morphophonological word. Alongside these cases we can recognise others where there is no clear boundary in the phonology between the two. We can call this pattern fusion, though it could be simply a case of extreme reduction of clitics or incorporated words. It can easily be illustrated from English, where some cliticized verbs can also be fused with their host into a single indivisible morphophonological word; for example, you are can be reduced (by cliticization) to you're, with exactly the same pronunciation as your; worse still, in a non-rhotic accent it has the same pronunciation as yaw, where the verb and pronoun have fused into a single CV structure whose V belongs equally to both the syntactic words and the second word has no separate phonology at all. Similarly, in French the sequence de le, `of the', is fused as du, though in this case the fusion is obligatory. One might even argue that English today is a fusion of this day, explaining the gap in the series this morning, this afternoon, this evening, *this day (Rosta 1996, in preparation).

The essential point about all these three subtypes of our first mismatch-category is that they involve a single morphophonological word which corresponds to (at least) two distinct syntactic words: S+S = P. The differences among the subtypes are less important, and may turn out to be unreal, or matters of degree. The difference between clitics and incorporation may be as blurred as the corresponding difference between function words and lexical words; and as we saw in the English example of you're, cliticization may overlap with fusion. The main challenge for theories of language structure is to allow enough flexibility in the mapping from syntax to phonology, without however treating these mismatched cases as the normal pattern.

Having established the reality of mismatches, we shall move more quickly through the remaining mismatch patterns.

B. S = P+P: One syntactic word corresponding to two morphophonological words. This pattern may be illustrated by some kinds of English compound, such as steely-eyed or head-over-heels, with two word-stresses, or sixth sense, whose first element ends in a consonant cluster /ksþ/ which is normally found only at the end of a word. Presumably so-called `phrasal words' such as good for nothing and French trompe-l'oeil, `illusion', belong in this category (Spencer 1991:426).

C. C + C = S: Two semantic units corresponding to a single syntactic word. In our present state of ignorance about lexical semantic structure it's impossible to decide whether this pattern is possible. [22 October 1999: inflectional morphology: Dog + Plural = dogs]

D. C = S + S: One semantic unit corresponding to two (or more) syntactic words. This is a standard definition of an idiom, such as hot dog or kick the bucket.

The above examples involve many-to-one mismatches. The next four cases have a word on one level which does not correspond to anything at all on the next level.

E. (S) = P: A morphophonological word which does not correspond to anything at all in the syntax. Maybe this description fits hesitation forms such as English er and um, contrasting with Scottish English eeh, French eu and so on - words which are clearly learned and part of one's linguistic competence, and which fit the general morphophonological patterns of the language concerned, but which have no status at all in syntactic structure.

F. S = (P): Morphophonologically zero words. In a surface-oriented approach to grammar there have to be extremely compelling reasons for recognising syntactic words which are inaudible. There are some good candidates (such as the zero copula which exists in African-American English according to Labov 1969), but alternative analyses are always possible, so I don't know whether or not to recognise this possibility. [22 October 1999: yes: PRO]

G. (C) = S: Semantically empty words. These almost certainly exist, though the issue is still controversial. Expletive pronouns are strong candidates for semantic emptiness, but greetings and so on may also not have any meaning in terms of semantic structures as such.

H. C = (S): Implicit semantic elements. These are commonplace. Wherever a (syntactic) complement is optional, the semantic element that it would have expressed becomes implicit; e.g. shave always has two arguments (in the semantics), but the shave-ee may be implicit as in He shaved after cleaning his teeth.

In all the examples discussed so far we have considered numerical mismatches, where a single word on one level corresponds either to a sequence of more than one or to less than one element on another. The two remaining categories illustrate another possibility:

I. S/S = P: Single morphophonological words that qualify simultaneously as two different (and conflicting) syntactic words. For instance, in many languages participles are normal verbs in their syntactic valency, but are normal adjectives both in their distribution and in their inflectional features. The following Latin example illustrates the point: